How can I count the unique words in a program?

CSCI-15 Assignment #2, String processing. (60 points) Due 9/23/13

You MAY NOT use C++ string objects for anything in this program.

Write a C++ program that reads lines of text from a file using the ifstream getline() method, tokenizes the lines into words ("tokens") using strtok(), and keeps statistics on the data in the file. Your input and output file names will be supplied to your program on the command line, which you will access using argc and argv[].

You need to count the total number of words, the number of unique words, the count of each individual word, and the number of lines. Also, remember and print the longest and shortest words in the file. If there is a tie for longest or shortest word, you may resolve the tie in any consistent manner (e.g., use either the first one or the last one found, but use the same method for both longest and shortest). You may assume the lines comprise words (contiguous lower-case letters [a-z]) separated by spaces, terminated with a period. You may ignore the possibility of other punctuation marks, including possessives or contractions, like in "Jim's house". Lines before the last one in the file will have a newline ('\n') after the period. In your data files, omit the '\n' on the last line. You may assume that the lines will be no longer than 100 characters, the individual words will be no longer than 15 letters and there will be no more than 100 unique words in the file.

Read the lines from the input file, and echo-print them to the output file. After reaching end-of-file on the input file (or reading a line of length zero, which you should treat as the end of the input data), print the words with their occurrence counts, one word/count pair per line, and the collected statistics to the output file. You will also need to create other test files of your own. Also, your program must work correctly with an EMPTY input file – which has NO statistics.

Test file looks like this (exactly 4 lines, with NO NEWLINE on the last line):

1
2
3
4
the quick brown fox jumps over the lazy dog.
now is the time for all good men to come to the aid of their party.
all i want for christmas is my two front teeth.
the quick brown fox jumps over a lazy dog.


Copy and paste this into a small file for one of your tests.

Hints:

Use a 2-dimensional array of char, 100 rows by 16 columns (why not 15?), to hold the unique words, and a 1-dimensional array of ints with 100 elements to hold the associated counts. For each word, scan through the occupied lines in the array for a match (use strcmp()), and if you find a match, increment the associated count, otherwise (you got past the last word), add the word to the table and set its count to 1.

The separate longest word and the shortest word need to be saved off in their own C-strings. (Why can't you just keep a pointer to them in the tokenized data?)

Remember – put NO NEWLINE at the end of the last line, or your test for end-of-file might not work correctly. (This may cause the program to read a zero-length line before seeing end-of-file.)

This is not a long program – no more than about 2 pages of code.

Here is my solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
#include<iostream>
#include<iomanip>
#include<fstream>
using std::cout;
using std::ifstream;
using std::ofstream;
using std::endl;
using std::cin;
using std::getline;

void totalwordCount(ifstream&, ofstream&);
void countLines(ifstream&, ofstream&);
void longestWord(ifstream&, ofstream&, char);
void shortestWord(ifstream&, ofstream&, char);
void uniquewordCount(ifstream&, ofstream&, char);

// Read and print the total number of words from the file.
void totalwordCount(ifstream &inputFile, ofstream &outputFile)
{
	char totalwords[100][16]; // Holds the total number of words.
	char *token;
	int totalCount = 0; // Counts every word.
	// Read every word in the file.
	while(inputFile >> totalwords[99])
	{
		totalCount++; // Increment the total number of words.
		// Tokenize each word and remove spaces, periods, and newlines.
		token = strtok(totalwords[99], " .\n"); 
		while(token != NULL)
		{
			token = strtok(NULL, " .\n");
		}
	}
	// Display the total number of words.
	outputFile << "Total number of words in file: " << totalCount << endl;
}

// Read and print the total number of lines in the file.
void countLines(ifstream &inputFile, ofstream &outputFile)
{
	static char lines[100]; // Holds the total number of lines.
	int lineCount = 0; // Counts every line.
	// Read every line in the file.
	while(inputFile.getline(lines,100))
	{
		lineCount++; // Increment the total number of lines.
	}
	// Display the total number of lines.
	outputFile << "Total number of lines in file: " << lineCount << endl;
}

// Find and print the longest word in the file.
void longestWord(ifstream &inputFile, ofstream &outputFile, char words[100][16])
{
	char *longest[16]; // Holds the longest word.
	int length; // Holds the length of each word.
	
	// Search every line in the file.
	while(!inputFile.eof())
	{	
		inputFile >> words[99][15]; // Read every word.
		// If one word is longer than or equal to the other, print either one.
		if(strlen(longest[15]) >= length)
		{
			length = strlen(longest[15]);
		}
	}
	// Print the longest word.
	outputFile << "Longest Word: " << longest[15] << endl;
}

// Find and print the shortest word in the file.
void shortestWord(ifstream &inputFile, ofstream &outputFile, char words[100][16])
{
	char *shortest[16]; // Holds the shortest word.
	int length; // Holds the length of each word.
	
	// Search every line in the file.
	while(!inputFile.eof())
	{	
		inputFile >> words[99][15]; // Read every word.
		// If one word is shorter than or equal to the other, print either one.
		if(strlen(shortest[15]) <= length)
		{
			length = strlen(shortest[15]);
		}
	}
	// Print the shortest word.
	outputFile << "Shortest Word: " << shortest[15] << endl;
}

void uniquewordCount(ifstream &inputFile, ofstream &outputFile, char words[100][16])
{
	int counter[100]; // Holds the associated counts.
	char *tok;
	int uniqueCount = 0; // Counts the total number of unique words
	// Read every unique word in the file.
	while(!inputFile.eof())
	{
		inputFile >> words[99];
		// If there is a match, increment the associated count.
		if(strcmp(words[99], tok) == 0)
		{
			counter[99]++;
		}
	}
	// Display each unique word and its associated count.
	for(int i = 0; i < 100; i++)
	{
		cout << words[i] << ":" << counter[i] << endl;
	}
}

// Call every function.
int main(int argc, char *argv[])
{
	ifstream inputFile;
	ofstream outputFile;
	char words[100][16];
	char inFile[12] = "string1.txt";
	char outFile[16] = "word result.txt";
		
	// Get the name of the file from the user.
	cout << "Enter the name of the file: ";
	cin >> inFile;
	
	// Open the input file.
	inputFile.open(inFile);
	
	// Open the output file.
	outputFile.open(outFile);
	
	// If successfully opened, process the data.
	if(inputFile)
	{
		// Loop through each function.
		while(!inputFile.eof())
		{	
			totalwordCount(inputFile, outputFile);
			countLines(inputFile, outputFile);
			longestWord(inputFile, outputFile, words);
			shortestWord(inputFile, outputFile, words);
			uniquewordCount(inputFile, outputFile, words);
		}
		// Close the input file.
		inputFile.close();
		// Close the output file.
		outputFile.close();
		return 0;
	}
	else
	{
		// Display the error message.
		cout << "There was an error opening the input file.\n";
	}
}


Question: In the uniquewordCount() function, I am having trouble counting the total number of unique words and counting the number of occurrences of each word. In the shortestWord() and longestWord() function, I am having trouble printing the longest and shortest word in the file. In the countLines() function, I think I got that function correct, but it is not printing the total number of lines. Is there anything that I need to fix in those functions?
I would map unique words to number of occurrences. Map maps unique keys to values. It's very efficient because Map uses a binary tree structure with the condition that at a branch, the key branching left < the key branching right, which enables binary search of keys, O(logn), really log2n. This means that if your set of words is 1000000 in size, then checking wether a word is in the map, or grabbing a word to access or modify it's associated value, takes log2(1000000) steps, which is about 20.

psuedo code
1
2
3
4
5
6
7
8
9
10
11
12
13
map <string, unsigned> wm
while get and split line
   ++lines
   for (word : line)
        ++word_count
        if (wm.has(word)) 
            wm(word)++
        else 
            wm.put(word, 1);
    
unique_words = map.size()
shortest = min(wm.values).key
longest  = max(wm.values).key


EDIT:
I just realized that I didn't read your instructions fully. I think the professor expects you to not use something like map, as he doesn't even allow use of string.

Sorry.
Last edited on
This is a pretty significant homework assignment.

Yes, you need to keep a pairing of (word, number of occurrences). That's why you need two arrays.

I have just glanced at your code, but here are some obvious issues:


(1) main()
You have a number of functions to do individual things, each one reading the file to the end.

However, in your main function, you have a loop that tries to pass the same file, multiple times, to each function in succession. This won't work. Remember, after line 139, the file is at EOF. Lines 140 through 143 have no hope of working.


(2) Looping on EOF
1
2
3
4
5
6
// Bad! Don't do this!
while (!file.eof())  // While not at EOF
{
    file >> whatever;         // Try to read something
    do_something( whatever );  // Do something with it 
}

Do you see the problem? Line 4 may fail, because you tried to read something and found EOF. Except, you ignore that fact and go ahead to line 5, which uses a garbage whatever. That'll mess up your counts.

You should be reading the input file like this:
1
2
3
4
5
6
char line[ 101 ];  // 100 characters in input line plus null terminator!

while (inputFile.getline( line, 101 ))
{
    // do something with 'line'
}


Dang. I've got to go for a bit. I'll come back and edit this post to finish posting the help you need.

[edit] Mmm. Ice cream with the kids!


(3) Newline baloney
It is easy enough to ignore blank lines. The instant you try to strtok() anything out of it you'll get a NULL, so it doesn't cost you anything. So ignore all the stuff about newlines and impress your professor.


(4) Functions
It is a good idea to break a problem into smaller problems, but what you have done is broken a big problem into a bunch of separate problems. The problem with that is that the subproblems are not necessarily disjoint.

Let's list the things you have been asked to do:

  1 - count the total number of words (ambiguous!)
  2 - count the number of unique words
  3 - count the number of times each word appears
  4 - count the number of lines
  5 - find the longest word
  6 - find the shortest word

The first item is actually ambiguous. Does your professor mean the total number of individual words found in the file? Or does he mean the total number of different words found in the file? You can ask him, or just give both answers in your output.

You have also been told that you need to keep

  - A list of every word you find in the file
  - A list that counts the number of times each word occurs

That helps a lot with the first few items.

  1 (total number of individual words) - print the sum of all elements in counts list (but don't do that. There's a simpler way)
  1 (total number of different words) - print the number of words in your list of words
  2 - print the number of words in your list with a matching count of 1
  3 - print each word in the list and its count
  4 - no help
  5 - find and print the longest word in the list
  6 - find and print the shortest word in the list

So, it looks like your primary problem is to construct the list of words and the matching counts. Along the way you should also be counting the number of lines in the file and number of times strtok() gives you a word.

(Remember, if you add a word to the words list, you must also add a count of 1 to the counts list at the same index.)

So, here are the functions I suggest you need:

  int FindWordInList( const char* word, const char words[][16], int size );
    
Returns the index in the list of the word if the word is in the list.
    Returns -1 (or the current number of words in the list, your choice) if the word is not in the list.

  int CountNumberOfUniqueWords( const int counts[], int size );
    
Unique words have a count of 1. It doesn't matter here what the word actually is -- we only care if it exists with a count of 1.

  void PrintWordsAndCounts( const char words[][16], const int counts[], int size );
    
Does what it says.

  int FindLongestWord( const char words[][16], int size );
    
Returns the index of the longest word.

  int FindShortestWord( const char words[][16], int size );
    
Returns the index of the shortest word.


(5) main(), redux
Now, your main function should be doing this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
  ifstream inputFile( ... );
  if (!inputFile) ...;

  number of lines = 0
  number of words in list = 0
  number of words in file = 0

  while (inputFile.getline( s, 101 ))
  {
      increment number of lines
      
      start strtok()ing
      while (strtok()'s result is not NULL)
      {
          increment number of words in file

          if word is found in list:
              update counts[]
          otherwise:
              append word to words[]
              append 1 to counts[]
              increment number of words in list

          print stuff you have been asked to print here
      }
  }

  PrintWordsAndCounts(...);

  print staticstics:
    print number of words in file
    print number of words in words list
    print CountNumberOfUniqueWords(...)
    print longest word
    print shortest word 



(6) Misc
You don't need line 9. The getline() function you should be using is a member function of the ifstream class.

You don't need lines 11 through 15. Prototypes tell other functions what your function looks like before your function is defined. There is no point in both prototyping and defining your function before it is used.

1
2
3
4
5
6
7
8
int foo();

int foo() { cout << "fooey!\n"; return -7; }

int main()
  {
  cout << foo() << endl;
  }

See how line 1 is totally superfluous?


Well, that's all you're getting out of me for now. Hope this helps.
Last edited on
Topic archived. No new replies allowed.