Counts words in a string

Pages: 12
I am trying to count the number of words in a string. Right now I have some code that works, but it counts the number of spaces. And the number of spaces don't match the number of words, because the first word is not being counted. Also if the user puts multiple spaces it will count that as however many spaces they put. Is there any way to fix this?

Here is the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
void CountWords()
{
	int numspaces = 0;
	char nextChar;
	// checks each character in the string
	for (int i=0; i<int(myString.length()); i++)
	{
		nextChar = myString.at(i); // gets a character
		if (isspace(myString[i]))
			numspaces++;
	}
	cout << "\nThere are " << numspaces << " words in this string.";

}


Here is the output:
Enter a string: Testing this out.


A: Count vowels
B: Count non-vowels
C: Display string length
D: Display string in reverse
E: Display string in uppercase
F: Display string in lowercase
G: Count words in string
H: Enter another string
I: Exit

Enter your choice g

There are 2 words in this string.


As you can see I put 3 words...also if I put another space at the end it would give me one more word. Even though that space as nothing following it.
Make this change
cout << "\nThere are " << numspaces+1<< " words in this string.";

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int CountWords(std::string strString)
{
  int nSpaces = 0;
  unsigned int i = 0;

  // Skip over spaces at the beginning of the word
  while(isspace(strString.at(i)))
    i++;

  for(; i < strString.length(); i++)
  {
    if(isspace(strString.at(i)))
    {
      nSpaces++;

      // Skip over duplicate spaces & if a NULL character is found, we're at the end of the string
      while(isspace(strString.at(i++)))
        if(strString.at(i) == '\0')
          nSpaces--;
    }
  }

  // The number of words = the number of spaces + 1
  return nSpaces + 1;
}
Last edited on
closed account (EzwRko23)
OMG.

def countWords(str: String) = str.split("\\s+").size

25 to 1 :D Sorry, really couldn't resist.
Last edited on
The same can be accomplished in C++ in one line using the boost::regex library.
closed account (EzwRko23)
No. One line gets wasted for the include. :D (just joking)

---------------------
A C++ solution added later:
1
2
3
4
5
6
7
8
9
10

#include <vector>
#include <string>
#include <boost/algorithm/string/split.hpp>

int countWords(std::string str) {
  vector< std::string > result;
  boost::algorithm::split_regex(result, str, regex( "\\s+" ));
  return result.size();
}


Definitely not a one liner, even excluding imports.
Last edited on
It is easy to create one-liners when you cheat. One-liners are overrated -- especially when they are so resource-wasteful as to create an entire new object just to throw away.

There are a zillion ways to do this, but just using the STL is simple enough, creating only one copy of the string and traversing it only four times:

1
2
3
4
#include <algorithm>
#include <cctype>
#include <functional>
#include <string> 
1
2
3
4
5
6
7
8
inline unsigned CountWords( const std::string& s )
  {
  std::string x = s;
  std::replace_if( x.begin(), x.end(), std::ptr_fun <int, int> ( std::isspace ), ' ' );
  x.erase( 0, x.find_first_not_of( " " ) );
  if (x.empty()) return 0;
  return std::count( x.begin(), std::unique( x.begin(), x.end() ), ' ' ) + !std::isspace( *s.rbegin() );
  }

Like your code, this has the advantage over the OP's code in separating on any whitespace character, not just ASCII spaces.

Oh, and if you have a fancy function to split stuff:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <algorithm>
#include <cctype>
#include <functional>
#include <list>
#include <string>

template <typename Predicate>
std::list <std::string> split( const std::string& s, Predicate p )
  {
  std::list <std::string> result;
  std::string::const_iterator rtok;
  std::string::const_iterator ltok = std::find_if( s.begin(), s.end(), std::not1( p ) );
  while (ltok != s.end())
    {
    rtok = std::find_if( ltok, s.end(), p );
    result.push_back( std::string( ltok, rtok ) );
    ltok = std::find_if( rtok, s.end(), std::not1( p ) );
    }
  return result;
  }

...then you can write your one-liners:

1
2
3
4
inline unsigned CountWords( const std::string& s )
  {
  return split( s, std::ptr_fun <int, int> ( std::isspace ) ).size();
  }


:-\
Did anyone say one-liner?
1
2
3
4
size_t word_count(std::string s)
{
	return s.empty() ? 0 : std::count_if(std::find_if(s.begin(), s.end(), std::not1(std::ptr_fun<int, int>(std::isspace))), std::unique(s.begin(), s.end()), std::ptr_fun<int, int>(std::isspace)) + !std::isspace(*s.rbegin());
}

...not heavily tested...
Woah, sweet!

The very first thing I wrote was almost identical, but erred on how to handle leading whitespace.
(That and I had as argument a constant reference...)
Your function only has one error: you cannot rely on
  s.empty(),
but must use
  ((size_t)std::count_if( s.begin(), s.end(), std::ptr_fun <int, int> ( std::isspace ) ) == s.length()).

Alas, that makes a pretty long line, but it is indeed a one-liner:

 
size_t CountWords( std::string s ) { return ((size_t)std::count_if( s.begin(), s.end(), std::ptr_fun <int, int> ( std::isspace ) ) == s.length()) ? 0  : std::count_if( std::find_if( s.begin(), s.end(), std::not1( std::ptr_fun <int, int> ( std::isspace ) ) ), std::unique( s.begin(), s.end() ), std::ptr_fun <int, int> ( std::isspace ) ) + !std::isspace( *s.rbegin() ); }

I think that takes it down to three passes over the string...

+1
Most excellent! I missed a nasty bug. :o)

closed account (EzwRko23)
OMG. A whole thread about a trivial issue. And even some buggy code posted.
Whatever you write, and however youn format your code, it won't be simpler than str.split("\\s*").size. Yes, it can be a little faster, but don't optimize until you really need it.

It is easy to create one-liners when you cheat. One-liners are overrated -- especially when they are so resource-wasteful as to create an entire new object just to throw away.


Mine is less resource wasteful than whole lot of C++ code posted here (single pass instead of 3 passes - and for a large string, memory access is the bottleneck here). Thanks to one-liners, instead of concentrating how to solve trivial problems like counting words in a string, you can go on and concentrate on really hard stuff.

In many good companies you would not be able to commit such badly formatted and unreadable code (in any language). This would not pass code-review.
 
size_t CountWords( std::string s ) { return ((size_t)std::count_if( s.begin(), s.end(), std::ptr_fun <int, int> ( std::isspace ) ) == s.length()) ? 0  : std::count_if( std::find_if( s.begin(), s.end(), std::not1( std::ptr_fun <int, int> ( std::isspace ) ) ), std::unique( s.begin(), s.end() ), std::ptr_fun <int, int> ( std::isspace ) ) + !std::isspace( *s.rbegin() ); };


@xorebxebx

This is just a bit of fun trying to write the program in one line in C++ which was never designed for non-verbose code. Three passes are only necessary to fit in with the one-line requirement.

No one is pretending C++ is brief. If you don't like C++ why don't you leave? I don't know of any regular here that appreciates your anti-C++ rhetoric.


@xorebxebx

And while we are on the subject of your contribution.

1) It is in the wrong language.

2) Creating an array to generate a word count is absurd.

3) Anyone doing that should be lucky not to get the sack. I wouldn't hire them in the first place.

xorebxebx
I know you are an industry flunky, but that does not mean everything that spouts out of you is correct.

First off, yours is not a single pass, unless your split algorithm is really smart (meaning it has a tradeoff on memory usage).

Secondly, three passes over an array is significantly cheaper than memory allocation and copying -- something you should know.

Third off, having a bug in the initial attempt doesn't validate any argument, either for or against -- and it is a logical fallacy to present it as such.

Fourth off, C++ wasn't designed to look pretty on one line. Format that function nicely and it would pass code reviews easily:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
size_t CountWords( std::string s )
  {
  // If the string contains only whitespace, then there are no words.
  //
  // Otherwise:
  //   number of words := number of whitespace sequences - leading whitespace + trailing whitespace
  //
  return ((size_t)std::count_if( s.begin(), s.end(), std::ptr_fun <int, int> ( std::isspace ) ) == s.length())
       ? 0
       : std::count_if(
           std::find_if(
             s.begin(),
             s.end(),
             std::not1( std::ptr_fun <int, int> ( std::isspace ) )
             ),
           std::unique(
             s.begin(),
             std::replace_copy_if(
               s.begin(),
               s.end(),
               s.begin(),
               std::ptr_fun <int, int> ( std::isspace ),
               ' '
               )
             ),
           std::ptr_fun <int, int> ( std::isspace )
           )
         + !std::isspace( *s.rbegin() );
  }
(Properly formatting it made me notice one more thing that was missing... which is why I originally said four passes...)

Finally, you are the one who started this "trivial issue". But there you have erred on arrogance again: it is not a trivial issue. Counting patterns in data is a very complex subject -- again something you should know.

So spend your hatred elsewhere.


Oh, BTW, I fixed a bug in your code. (Pay attention to the regexp):
 
def countWords(str: String) = str.split("\\s+").size


Alas. I've had enough of this now.
Isn't this a one-liner?
CountWords ( "foo bar" );
;^)
This problem depends on a number of assumptions, such as what is a "word". Regardless, I'm surprised no one had posted something like this:
1
2
3
4
5
6
7
int count_words( string s ) {
    int word_count( 0 );
    stringstream ss( s );
    string word;
    while( ss >> word ) ++word_count;
    return word_count;
}

Last edited on
+1 moorecm

That's a great solution.

It is easy to generalise your solution to count streams of arbitrary length with no additional resource overhead:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// general case, stream interface
inline size_t word_count(std::istream& is)  // can pass an open std::ifstream() to this if required
{
	size_t c = 0;
	for(std::string w; is >> w; ++c);
	return c;
}

// simple string interface
inline size_t word_count(const std::string& str)
{
	std::istringstream iss(str);
	return word_count(iss);
}

// EDIT: - just thought I'd add this ;o)

int main()
{
	std::ifstream ifs("monster-humungus-file-way-to-large-to-split-into-an-array.txt");

	size_t words = word_count(ifs);
	// ... etc ...
}
Last edited on
I like it.
closed account (EzwRko23)

Secondly, three passes over an array is significantly cheaper than memory allocation and copying -- something you should know.


Oh really? First - there is no copying in my code. substring() used behind the scenes by the split function does not copy anything. Second, for a large string, an additionall pass is much more expensive than additional memory allocation - because of cache issues. Don't judge Java code by C++ performance characteristic. Memory allocation is slow in C++ (~80-300 cycles per small object for typical allocators), but fast in Java (< 20 cycles per small object), because Java allocates from continuous memory region - it just increases a pointer.


Creating an array to generate a word count is absurd.


As long as it is unnoticeable to the user, it is not. When it is, it can be trivially replaced by a better option. You are doing it the wrong way. First write something that works, then optimize if needed. You should listen to the people that know programming better than you (D. Knuth said optimisation is the root of all evil). Doing the other way round is a recipe for bloated unmaintainable code, noone likes to work with.



Finally, you are the one who started this "trivial issue".


I only laughed at one of the solutions of this trivial issue - a 25 line long code for something that should take 5 seconds to write and should not be really longer than a few (like one) short lines of code. The OP question was just to count the words, not how to count them in one pass, without creating temporary objects etc.


Anyway: a one-liner that doesn't copy anything, doesn't allocate a large array*, and does only a single pass:

 
def countWords(s: String) = "[^\\s+](\\s+|$)".r.findAllIn(s).size


Short as in Perl, fast as in C. :P

*) findAllIn returns a lazy iterator and does matching on demand


Counting patterns in data is a very complex subject

Counting is trivial. Detecting patterns is not. But this is done by regexp here (and regexp implementation is pretty trivial).



Last edited on
D. Knuth said optimisation is the root of all evil

Actually, he said that 95% of the time premature optimization is the root of all evil, and he was quoting C. A. R. Hoare, IIRC. There's a big difference there. It's one of the biggest misquotes ever (along with "play it again, Sam"), so I thought I'd clear that up.
Pages: 12