Most Common Word in a Document

Pages: 12
I hope this is the right place to post this.....

I am currently having to write a program that opens a file and figure out which word appears the most. It has to be able to deal with punctuation and numbers so that they don't effect the count. In other words I have to make sure "wow.", "wow!", and "wow?" are all counted as just "wow" and if it's something like "S20" I need to delete the numbers and keep the "S".
I think I can figure that part out, but the part I'm having trouble with is that if the string is something like "h3llo", I have to delete the 3, then keep the "h" and "llo" as two separate words.

Here is what I have so far...(I know I'm an amateur, please be kind)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#include <iostream>
#include <string>
#include <cstring>
#include <fstream>
#include <cctype>
using namespace std;

const int MAX = 20;

string words[MAX];
int count[MAX];
int counter=0;

string Format(string entry)
{
	int size = entry.length();

	for(int chr = 0; chr < size; chr++)
	{
		if (isdigit(entry[chr]))
			entry[chr] = ' ';
		if (isupper(entry[chr]))
			entry[chr] = tolower(entry[chr]);
		if (ispunct(entry[chr]))
			entry[chr] = ' ';
	}
	return entry;
}

void WordCounter(string word)
{
	for (int i = 0; i < counter; i++)
		if(word == words[i])
			count[i]++;
	words[counter] = word;
	count[counter] = 1;
	counter++;
}

int main(int argc, char* argv[])
{

	ifstream Input("3.txt");
	string word, word2;


	if(!Input)
		cout << "There is something wrong with the file!" << endl;

	while (Input >> word)
	{
		word2 = Format(word);
		WordCounter(word2);
	}

	for (int i = 0; i < counter; i++)
		cout << words[i] << " " << count[i] << endl;


	return 0;
}


This is the contents of the file that I'm currently using to test this, but I will need it to be able to process other files...

How are you?
Fine! Thanks. And you?
I'm fine, too.



I can't use any STL code. and so far all the output is just to show me what it's doing so far.

Here is the current output...

how 1
are 1
you  2
fine  2
thanks  1
and 1
you  1
i m 1
fine  1
too  1



EDIT: I forgot to mention that if there is more than one most common word I have to state both of them. So for this file I would need to state something like
you: 2
fine: 2
Last edited on
I think I can figure that part out, but the part I'm having trouble with is that if the string is something like "h3llo", I have to delete the 3, then keep the "h" and "llo" as two separate words.


If you go through the whole thing once replacing every number and punctuation mark with a space, outputting everything to somewhere convenient, you can then go through it all again from that convenient place. "h3llo" will have become two separate words, as you wanted, and there will also be no punctuation in it.
Last edited on
Metalman488 wrote:
I can't use any STL code. [...]
1
2
3
4
5
#include <iostream>
#include <string>
#include <cstring>
#include <fstream>
#include <cctype> 


What counts as "STL" code in this context?

Note: The word "Standard" in the original Standard Template Library does not refer to the C++ standard. Most of the contents were included in the first C++ standard in 1998, where those facilities are now properly named part of the C++ Standard Library. IMO, nowadays "STL" usually means "C++ Standard Library" - but this surely isn't how you used it.
Last edited on
If you go through the whole thing once replacing every number and punctuation mark with a space, outputting everything to somewhere convenient, you can then go through it all again from that convenient place. "h3llo" will have become two separate words, as you wanted, and there will also be no punctuation in it.

I've modified the code bit since you replied, but I'm still having this issue...
So now I have a string array (words[]) that holds all of the words after punctuation and numbers have been removed, but it would still hold something like "h3llo" as "h llo". What code would you suggest for that?
I'm thinking I could make a function that accepts things like "h llo" and uses isspace() somehow to split it into two words, but how would I return those words and add them to the words[] array?
And I'm not going to know the files its going to be used with so for all I know a string in the file could be "how1are2you", which would be split into 3 words, so i wouldn't know how many strings that function i proposed would need to return.
What counts as "STL" code in this context?


I think it's the stuff from the link below.
http://www.geeksforgeeks.org/the-c-standard-template-library-stl/
Last edited on
if it's something like "S20" I need to delete the numbers and keep the "S".

Should I'm then not become Im instead of I am ?
So now I have a string array (words[]) that holds all of the words after punctuation and numbers have been removed, but it would still hold something like "h3llo" as "h llo". What code would you suggest for that?


I would suggest that instead of doing that, you insert them into a stringstream (remember to insert a space after each) and then extract them out again. The stringstream will take care of the spaces for you.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <string>   
#include <iostream>   
#include <sstream>      

using std::cout;
int main () {

  std::stringstream ss;

  ss << "beans on" <<" " << "toast";

  std::string x;
  ss>>x;
  cout << x; 
  cout << '\n';

  ss>>x;
  cout << x; 
  cout << '\n';
 
  ss>>x;
  cout << x; 
  cout << '\n';

  return 0;
}
I would suggest that instead of doing that, you insert them into a stringstream (remember to insert a space after each) and then extract them out again.

This.
for each word in input
  for each letter in word
    lowercase letter
    replace with space if not valid character
  append word to ss

for each word in ss
  accumulate statistics



PS. Your program has no safety guard against 'MAX <= counter' situation.

Should I'm then not become Im instead of I am ?


I think I am supposed to have it become "I" and "m". Two separate "words".


I would suggest that instead of doing that, you insert them into a stringstream (remember to insert a space after each) and then extract them out again. The stringstream will take care of the spaces for you.


That seems to work pretty well, but I'm still having trouble figuring out how to insert them into the words[] array afterword
Something like this, perhaps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <iostream>
#include <string>
#include <fstream>
#include <cctype>
#include <sstream>
#include <iomanip>

// return a string with characters other than alpha characters
// in the line replaced with spaces, and with all upper case characters
// converted to lower case
std::string format( std::string line )
{
    std::string result ;

    for( char& c : line ) // for each character in line
    {
        // if it is an alpha character, add it to result
        if( std::isalpha(c) ) result += std::tolower(c) ; 

        else result += ' ' ; // otherwise, add a space to result
    }

    return result ;
}

// reads words one by one into the array up to arr_size words
// return the number of words that were read in
int get_words( std::istream& stm, std::string words_array[], int arr_size )
{
    int num_words = 0 ; // number of words read so far

    std::string line ;

    // for each line in the input stream
    while( num_words < arr_size && std::getline( stm, line ) )
    {
        // create an input string stream to read from the formatted line
        std::istringstream str_stm( format(line) ) ;

        std::string word ;
        while( num_words < arr_size && str_stm >> word ) // for each word in the line
        {
            words_array[num_words] = word ; // add it to the words array
            ++num_words ; // and increment the count
        }
    }

    return num_words ;
}

int main()
{
    const int MAX = 1000 ; // maximum number of words that we can read
    std::string words[MAX] ;

    std::ifstream file( __FILE__ ) ; // this file: modify as required. eg.
                                     // std::ifstream file( "3.txt" ) ;

    const int nwords_read = get_words( file, words, MAX ) ;

    // print out the words that were read
    for( int i = 0 ; i < nwords_read ; ++i )
        std::cout << std::setw(4) << i+1 << ". " << words[i] << '\n' ;

    // take it up from there
}

http://coliru.stacked-crooked.com/a/dd7f1599eb84fd67
Last edited on
What about deleting the punctuation and number as the first step, and then going in a loop like this?

1
2
unsigned int i = 0;
while (std::cin >> arr[i]) { };


This would discard the whitespaces and neatly separate "h" and "llo" from each other.
Just my two cents, I already see some good code snippets in here.
This is my most up to date code.....
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#include <iostream>
#include <string>
#include <cstring>
#include <fstream>
#include <cctype>
#include <sstream>
using namespace std;

const int MAX = 500;

string words[MAX];
int count[MAX];
int counter=0;
stringstream hold;

void Splitter(string);
void WordCounter(string);

string Format(string entry)
{
	int size = entry.length();

	for(int chr = 0; chr < size; chr++)
	{
		if (isdigit(entry[chr]))
		{
			entry[chr] = ' ';
			Splitter(entry);
			entry = ' ';
		}
		if (entry[chr] == '\'')
		{
			entry[chr] = ' ';
			Splitter(entry);
			entry = ' ';
		}
		if (isupper(entry[chr]))
			entry[chr] = tolower(entry[chr]);
		if (ispunct(entry[chr]))
			entry[chr] = ' ';
	}
	return entry;
}

void Splitter(string chop)
{
	int size = chop.length();
	string str[size];
	hold << chop;

	for (int i = 0; i < size; i++)
	{
		hold >> str[i];
		WordCounter(str[i]);
	}

}

void WordCounter(string word)
{
	for (int i = 0; i < counter; i++)
		if(word == words[i])
		{
				count[i]++;
		}
	words[counter] = word;
	count[counter] = 1;
	counter++;
}

int main(int argc, char* argv[])
{

	ifstream Input("3.txt");
	string word, word2;


	if(!Input)
		cout << "There is something wrong with the file!" << endl;

	while (Input >> word)
	{
		word2 = Format(word);
		WordCounter(word2);
	}

	int MostCommon, MCindex;

	cout << "Counter: " << counter << endl;

	MostCommon = count[0];
	for (int i = 0; i < counter; i++)
	{
		if (count[i] > MostCommon)
		{
			MostCommon = count[i];
			MCindex = i;

		}
		cout << "_" << words[i] << "_" << " " << count[i] << endl;
	}


		cout << endl << "Most Common: "  << words[MCindex] << ' ' << MostCommon << endl;


	return 0;
}


The problem I'm having now is that if there is more than one most common word (multiple words with same highest value) I have to state all of them, and this just states the first one it finds.
For example, this file has "you" and "fine". both have 2 occurrences, which is more than any other word, but the program only states that "you" is the most frequent with 2 occurrences. I need it to state that "you" AND "fine" are the most common words with 2 occurrences each. Any ideas?
Don't just check for being the most common word found so far. Also check for the word being equally as common as the most common word.

1
2
3
4
5
6
7
8
9
10
11
12
vector<string> commonWords;

if (count[i] > MostCommon)
{
  MostCommon = count[i];
  commonWords.clear();
  commonWords.push_back(words[i]);
}
else if (count[i] == MostCommon)
{
  commonWords.push_back(words[i]);
}

Do two steps:
1. Find the maximum count.
2. Print all words that have that count.

Don't just check for being the most common word found so far. Also check for the word being equally as common as the most common word.

It's a good idea, but remember that I can't use anything from the STL library and that includes vectors.


Do two steps:
1. Find the maximum count.
2. Print all words that have that count.


These both make sense if i can figure out how to do it.... Would I need to make MCindex an array so that it can hold more than one index value? Then just do a loop to print them out with words[MCindex[i]]?

EDIT: I currently having this, but the output is pretty off...
1
2
3
4
5
6
7
8
9
for (int i = 0; i < counter; i++)
	{
		if (count[i] > MostCommon)
		{
			MostCommon = count[i];
			MCindex[i] = i;
		}
		else if (count[i] == MostCommon)
			MCindex[i++] = i;


It's outputting this...

Most Common: are 2

Most Common: how 2

Most Common: you  2

Most Common: thanks  2
Last edited on
string is STL. string is typedef for basic_string<char>


If you can't store the most common words as you go, then go round twice. The first time round, you establish the value of MostCommon.

The second time round, output every word whose count value equals that value.
Last edited on

string is STL


Well I know I'm not allowed to use the stuff listed on this site...
http://www.geeksforgeeks.org/the-c-standard-template-library-stl/

But I'm fairly certain I can use strings.
I expect so, yes. You're basically writing C code, with string and helper IO.
I currently having this, but the output is pretty off...

Record the index of the max item only:
1
2
3
4
5
6
7
8
MCindex=0;  // start by assuming the first item is most common
for (int i = 1; i < counter; i++)
	{
		if (count[i] > count[MCindex])
		{
			MCindex = i;
		}
	}

Your Format() function doesn't handle non-printing characters, which aren't numbers or punctuation. You should view the characters as "alpha" and "other":
1
2
3
4
5
6
7
8
9
10
11
12
13
14
string Format(string entry)
{
	int size = entry.length();

	for(int chr = 0; chr < size; chr++)
	{
		if (isalpha(entry[chr]) {
			entry[chr] = tolower(entry[chr]);
		} else {
			entry[chr] = ' ';
		}
	}
	return entry;
}

WordCounter is wrong. If it finds a word then it needs to return. Right now it will always add the word, regardless of whether it's found:
1
2
3
4
5
6
7
8
9
10
11
12
void WordCounter(string word)
{
	for (int i = 0; i < counter; i++)
		if(word == words[i])
		{
			count[i]++;
			return;
		}
	words[counter] = word;
	count[counter] = 1;
	counter++;
}


I don't understand what Splitter() is doing. I'd do the stringstream method:
1
2
3
4
5
6
7
        while (Input >> word) {
            word2 = Format(word);
            istringstream ss(word2);
            while (ss >> word) {
                WordCounter(word);
            }
        }



Record the index of the max item only:


The only problem is that if there is more than one max item ("fine" and "you" in this case) I have to state both of them.

As for the other suggestions, thank you very much for those they definitely helped clean things up a bit.
Pages: 12