Organize terms

Pages: 12

I need to organize 5 million terms within a txt by the number of occurrences. What software can I use or how to do it in c ++?

Hello victorio,

That would depend on what a "term" is and if you have enough memory to work with the file.

A small sample of the input file would help.

The first thing I would do is write the code to open the file and get the read of the file working.

Once you get this working post the code so everyone has a place to start.

Not knowing what you know some responses could be beyond what you know. Be prepared.

Andy
You can read the text in, a term at a time, and count in a map how many of each term there are.

What does "organise" mean? What do you want to do with this information?
5 million is nothing to a modern computer.
a brute force hash and count would do this in a few seconds on a cheap laptop.

for reference, std::sort can sort 6 million items in < 2 seconds single threaded on the old, weak laptop I use at work.

We need details if you need to do something exotic, but do not let a number in the low millions worry you. It will most likely fit entirely in ram in one block, and can be processed at astonishing speeds. If its too large to put into ram, it may requires 10s of seconds instead, but even so, are you concerned about the time it will take, or something else?
Last edited on
Guys, I just organize the .txt by the number of occurrences, the first is more occurred. The number of occurrences should appear next to the word.

I have a example:
1
2
3
4
5
6
7
8
9
10
more
rice 
nice 
apple
more 
rice
more
.
.
.


And return:

1
2
3
4
5
6
more  3
rice  2 
nice  1
apple 1
.
.


You know any software for these?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <vector>
#include <algorithm>
#include <string>
using namespace std;

struct PR
{
   string data;
   int count;
};


int main() 
{
   map<string,int> freq;
// ifstream in( "data.txt" );
   istringstream in( "more \n"
                     "rice \n"
                     "nice \n"
                     "apple\n"
                     "more \n"
                     "rice \n"
                     "more \n" );

   for ( string s; in >> s; ) freq[s]++;                             // frequency table

   vector<PR> pr;
   for ( auto e : freq ) pr.push_back( { e.first, e.second } );      // put into data/count container

   sort( pr.begin(), pr.end(), []( PR a, PR b ){ return a.count > b.count; } );     // sort descending

   for ( auto p : pr ) cout << p.data << '\t' << p.count << '\n';                   // output
}


more	3
rice	2
apple	1
nice	1
Easy one-liner :)
 
awk '{ sum[$1]++ } END { for ( i in sum ) { print i,sum[i] } }' file.txt
@lastchance Printed just 1 occurrence.
victorio wrote:
Printed just 1 occurrence.


I'm not sure what you are trying to say, @victorio.

My code printed the output given in my post, with a stringstream replacing the input file, so that you could try it in cpp.sh.

If you have changed your input file format then please let us know.
Last edited on
You could also add an overloaded operator < to struct PR.
Later you could either use a std::vector like now or insert the elements in a std::set (but this could be less efficient).
I used a .txt for input terms. @lastchance.
victorio wrote:
I used a .txt for input terms. @lastchance.

I'll take your word for it, but what file did you use, what did you call it, how did you amend the code, and what exactly do you mean by "Printed just 1 occurrence"?

If I put the contents of the stringstream in a file instead then it gives the same output as before.
Last edited on
I used this code, and return just words with 1 occurrence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <vector>
#include <algorithm>
#include <string>
using namespace std;

struct PR {
	string data;
	int count;
};

//int main() {
//	map<string, int> freq;
//	ifstream in("termos1_full.txt");
//
//	for (string s; in >> s;)
//		freq[s]++;                             // frequency table
//
//	vector<PR> pr;
//	for (auto e : freq)
//		pr.push_back( { e.first, e.second });   // put into data/count container
//
//	sort(pr.begin(), pr.end(), []( PR a, PR b ) {return a.count > b.count;}); // sort descending
//
//	for (auto p : pr)
//		cout << p.data << '\t' << p.count << '\n';                   // output
//}

My .txt with 5 millions words.
1
2
3
4
5
6
7
8
9
10
11
12
13
em
o
governo
publicada
hoje
revela
um
dado
supreendente
recusando
.
.
.
@victorio,

I used the following code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <vector>
#include <algorithm>
#include <string>
using namespace std;

struct PR
{
   string data;
   int count;
};


int main() 
{
   map<string,int> freq;
   ifstream in( "data.txt" );

   for ( string s; in >> s; ) freq[s]++;                             // frequency table

   vector<PR> pr;
   for ( auto e : freq ) pr.push_back( { e.first, e.second } );      // put into data/count container

   sort( pr.begin(), pr.end(), []( PR a, PR b ){ return a.count > b.count; } );     // sort descending

   for ( auto p : pr ) cout << p.data << '\t' << p.count << '\n';                   // output
}


with the following file data.txt:
em
o
governo
publicada
hoje
revela
um
dado
supreendente
recusando


and it gave the following output:
dado	1
em	1
governo	1
hoje	1
o	1
publicada	1
recusando	1
revela	1
supreendente	1
um	1


This is perfectly correct ... because the input file contains no repeats!!!
Return this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
consorte	1
consorciaram	1
consonancias	1
consomes	1
consolidaram	1
conspiraram	1
consolados	1
consoladores	1
consoladora	1
consolado	1
consolada	1
consolacoes	1
.
.
.
Well, @victorio, I don't see any repeats amongst those words, either.

I really don't understand what you expect. If you expect me to de-conjugate all your verbs, forget it.

Last edited on
I think my problem is .txt. I tested with other archives and run.
Now I need a similar code, but there are two words in the same line.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <fstream>
#include <sstream>
#include <map>
#include <vector>
#include <algorithm>
#include <string>
using namespace std;

struct PR
{
   string data;
   int count;
};


int main() 
{
   map<string,int> freq;
   ifstream in( "data.txt" );

   for ( string s; in >> s; ) freq[s]++;                             // frequency table

   vector<PR> pr;
   for ( auto e : freq ) pr.push_back( { e.first, e.second } );      // put into data/count container

   sort( pr.begin(), pr.end(), []( PR a, PR b ){ return a.count > b.count; } );     // sort descending

   for ( auto p : pr ) cout << p.data << '\t' << p.count << '\n';                   // output
}


1
2
3
4
5
6
rice nice
great bad
word work
.
.
.
Now I need a similar code, but…

For what I can see, you need exactly lastchance’s code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <fstream>
#include <iostream>
#include <map>
#include <string>


int main() 
{
   std::map<std::string, int> freq;
   std::ifstream in( "data.txt" );

   for ( std::string s; in >> s; /**/ ) {
       ++freq[s];
   }

   for (const auto& [s, i] : freq ) {
       std::cout << s << "\t\t" << i << '\n';
   }
}


Output:
bad             1
great           1
nice            1
rice            1
word            1
work            1

Sorry, I need to order the occurrences of the pair, what occurrences of the word pair in txt. Print:
1
2
3
rice nice        1
great bad      1
work word     1
Last edited on
Pages: 12