Word Stats (frequency, order)

I have been completing an assignment were I need to make a console that loads a text file called banned.txt into a bannedText array. I have done this.

I then had to load the words within a 'text1.txt' file and compare this with the banned words. If text1 contained a band word, the banned word is filtered out.

I was also given text2.txt, text3.txt and text4.txt. I needed to do the same with them all.

Here is what I have so far
http://i.gyazo.com/53ab68417ffc274a99767113d1c7c2b2.png

When i select '2' this is shown, the banned words filter perfectly.
http://i.gyazo.com/9fdb7086f8d54394cdcec8f071e606e5.png

The Problem
I have been asked to do the following.

• Calculate the 10 most frequent words from the text files. Do this for each individual file and for the files as a whole.
• List the top 10 words alphabetically.
• Calculate how times each banned word was found, both as a whole word and as a sub-string.



I currently have no idea how to go about this.
Anyone have any ideas?
Last edited on
bump

anyone know how to get the frequency of words from a text file?
Problem:
# Read every word in the file and count them all. The 10 highest counts win.
# You have to consider whether case-sensitivity, are "Word" and "WORD" the same word?

std::map would be nice for this, but there's the question of sorting by value which you can't do. There's boost::multi_index_container which could help, but this works too -

store everything in a map, this is useful for the frequency part
copy everything to a vector of pairs, this is useful for the sorting part
sort the vector based on frequency, i.e. pair::second
display results

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <iostream>
#include <map>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>
#include <cctype>

using namespace std;

void Lower(string &word)
{
	for_each (word.begin(), word.end(), [] (char &c) {c=tolower(c);});
}

bool LessThan(pair<string, size_t> left, pair<string, size_t> right)
{
	if (left.second == right.second) return (left.first < right.first);
	return (left.second < right.second);
}

int main()
{
	map<string, size_t> words;
	ifstream inf("testdata.txt");
	string word;
	while (inf >> word)
	{
		Lower(word);
		++words[word];
	}

	// for (auto m : words) cout << m.first << "\t" << m.second << endl; // display original data
	vector<pair<string, size_t>> freq;
	for (auto m : words) freq.push_back(m);
	sort(freq.begin(), freq.end(), LessThan);
	// for (auto m : freq) cout << m.first << "\t" << m.second << endl; // display new sorted data
	cout << "The 10 most frequently used words are:" << endl;
	for (int i = 0; i < 10; ++i)
	{
		cout << (freq.crbegin() + i)->first << " " << (freq.crbegin() + i)->second << " times" << endl;
	}
}
FWIW, I run into this problem pretty frequently at work. If you can get the words into individual lines of a file then a unix system can do it easily:
sort | uniq -c | sort -n | tail -10
sort - sort the lines
uniq -c - collapse identical lines and precede them with the number of times they occur
sort -n - now sort the result numerically.
tail -10 - print the last 10 lines of the sorted result: that's the 10 most frequent lines.
@tipaye

Looks VERY complicated, I do not understand the half of it, I shall try to understand it and impliment it tho thanks for the reply.

and @dhayden

what is unix system?
Topic archived. No new replies allowed.