How to count words effectively?

Hi, I made this small example for counting words, but it doesn't work well when I put two sentences or more spaces between words since the entire counting works with spaces, dots, exclamation signs. Can somebody tell me some smart way of fixing this? And you don't have to put code, just what you think I should do. I know the problem is because I count each space as "a word" (I know, stupid, but its my first word counting example). I should also solve the problem of more spaces joined which doesn't give correct word number, and the problem of empty space after the end of one sentence. Help!

I have a file called text.txt and inside a sentence. I translated some text to english so that you guys could understand, so if there's some problem its because of translation. :)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include<iostream>
#include<cstdio>
#include<fstream>
#include<string>

int main(){
	std::ifstream in("text.txt");
	if(not in){
		std::perror("text.txt");
	}
	else {
		std::string text;
		unsigned int numWords = 0;
		getline(in, text);
		std::cout<<"Text: \""<<text<<"\""<<std::endl;
		for(int i=0; i<text.size(); i++){
			if(text[i] == ' ' || text[i] == '.' || text[i] == ',' || 
                           text[i] == '!' || text[i] == '?'){
				++numWords;
			}
		}
		in.close();
		std::cout<<"In this textual file \"text.txt\" there exists"<<numWords;
               std::cout<<" words!"<<std::endl;
	}
	return 0;
}
Last edited on
Well, it seems I got it now.
In the if statement I added this code:
1
2
&& (tekst[i-1] != ' ' && tekst[i-1] != '!' 
&& tekst[i-1] != '.' && tekst[i-1] != '?' )

and the code before that inside if statement I also placed in:
1
2
text[i] == ' ' || text[i] == '.' || text[i] == ',' 
|| text[i] == '!' || text[i] == '?' || text[i] == '\0'

and in for loop I changed so that the loop goes to the last element of he string: '\0':
for(int i=0; i<=tekst.size(); i++){
Just for now it seems to do the work...
Last edited on
hmmm..

dont you think these are so many conditions .. first you look for spaces and other characters and then you see if the previous one was not a space and other character.

what if we increment @numWords as soon a new word starts..!!??

something like:
1
2
3
4
5
6
7
8
9
10
11
12
for(int i=0; i<text.size();)
{
while(text[i] == ' ') //while we are finding spaces, tabs, new line etc etc keep incrementing i
i++;

//ok we got a new character after spaces, tabs etc etc
numWords++;

while(we are finding characters keep incrementing i)
i++;

}


does this makes any sense??
certainly we can optimize it also..
Yes, I can see your example IS better. It's better to count beginnings of words then spaces, dots,exclamation signs and other signs. The reason I did that way is because I didn't know how to pass the next characters of the word. Like, if I find first letter of a word OK, but how will I JUMP over those others letters of the same word without counting them. that is why I choose counting spaces and stuff and checking whether the character before is the same space, dot,exclamation,and so on, so that if there is for example !!! three signs together I don't count that as more words. I don't know do you understand me, I don't blame you if you don't since this is pretty inefficient way to count words. :D
Having said that, I do have question for you writetonsharma!
In the last while statement, I don't understand what should go inside. Since I'm inside a word now, do I check for different signs then ., ?, ! etc.? Because those sign are not part of the word. And if that is the case, do I put logical NOT like this:
s is the name of the string for this case...

while(s[i] != '.' && s[i] != '!' && s[i] != ' ' && s[i] != '?')

I did like this and the variable i was incremented forever = endless loop !
Please explain. And thank you for your post!
Is it like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include<iostream>
#include<string>

int main(){
	std::string s = "Some random text!!! This is cool. Is it?";
	int numWords = 0, i = 0;
	while(i < s.size()){
		if(s[i] != '.' && s[i] != '!' && s[i] != '?' 
                    && s[i] != ' ' && s[i] != ','){
			numWords++;
			while(s[i] != '.' && s[i] != '!' && s[i] != '?' 
                                  && s[i] != ' ' && s[i] != ',' && i < s.size()){
				i++;
			}
		}
		else {
			i++;
		}
	}
	std::cout<<numWords<<std::endl;
	//...
	return 0;
}
I don't know do you understand me


i understood you fully and thats why gave a new solution. :)

In the last while statement, I don't understand what should go inside. Since I'm inside a word now, do I check for different signs then


the thing is straigh forward.. keep the while loop open till you are finding characters.. so it could be like this:

1
2
while((text[i] >= 'a' && text[i] <= 'z') || (text[i] >= 'A' && text[i] <= 'Z'))
i++;


while(s[i] != '.' && s[i] != '!' && s[i] != ' ' && s[i] != '?')

this means till you dont find any of these and which you will not find in your string as your string is composed of alphabetic characters only. so it will be infinite.. thats correct.

see your string is composed of alphabets.. so apart from alphabets everything is space for you.. correct??

so you can make your case by using 'a', 'z', 'A' and 'Z'.. correct??






i changed your program a bit.. see if that works for you.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
int main()
{
	std::string s = "Some random text!!! This is cool. Is it?";

	int numWords = 0, i = 0;
	int size = s.size();


	while(true)
	{
		//skip spaces
		while(s[i] == '.' || s[i] == '!' || s[i] == ' ' || s[i] == '?' || s[i] == '\0')
		{
			if(++i >= size)
			{
				break;
			}
		}


		//now skip the current word to reach the spaces, the spaces will be taken care by first while loop
		while((s[i] >= 'a' && s[i] <= 'z') || (s[i] >= 'A' && s[i] <= 'Z'))
		{
			if(++i >= size)
			{
				break;
			}
		}

	
		if(i >= size)
			break;

		//spaces are skipped, after the end of each word increment.
		numWords++;
	}

	std::cout<<numWords<<std::endl;
	
	//...


	return 0;
}
I can see calculating size of string once is better then do it again and again, OK. I didn't know you can do things like this:
s[i] >= 'a'
Before you posted this I was thinking to convert the character to ASCII number and then do the same thing, just with numbers.
However, this works for english alphabet. How do I extend it for other? Do I just add more conditions like:
... && s[i] == 'š' && ... and so on?
Its a lot of conditions. Any suggestions?
'a' is equivalent to its ascii value.. you dont need to convert explicitly.

... && s[i] == 'š' && ...

for other languages you need to know the range of character set as for eg English we know it can range from a to z and A to Z.
similarly for other languages if you know the character set range it will be easy..

but there is a twist..
your application should be unicode for other languages.. because apart from english other character sets cant fit in 255 values. you need to use wchar_t.
let say you want to print š and lets say its ascii value is 3223. can it fit into a char type.. NO.
so for that you need a wchar_t data type which is of 2bytes.

refer to this post for details:
http://cplusplus.com/forum/windows/9797/
You could use isalpha, like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include<iostream>
#include<string>
#include<cctype>

int main(){
	std::string s = "Some random text!!! This is cool. Is it?";
	int numWords = 0, i = 0;
	while(i < s.size()){
		if(isalpha(s[i])){
			numWords++;
			while(i < s.size() && isalpha(s[i])) i++;
		}
		else {
			i++;
		}
	}
	std::cout<<numWords<<std::endl;
}

Topic archived. No new replies allowed.