Word Counter using a map

So I want to count the occurrences of every word in a text file. It scans the file and for each word passes it to my insert function. Here is the code for my insert function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
unordered_map <string, int>
insert (unordered_map <string, int> map1, string word)
{

    if (map1.find(word) != map1.end())
    {
        map1[word] ++;
    }
    else
    {
        map1.insert(unordered_map<string,int>::value_type(word,1));
    }

    return map1;
}


It works....for small text files. I want to use this for a large text file (750,000 words) but it doesn't complete in a reasonable time. Is there a way I can improve this? Thanks.
Last edited on
Your function takes a map by value (forcing a copy of every single string unless you used std::move) and returns it by value, forcing another copy (and RVO doesn't apply because a function argument cannot be RVO'd.

Pass by reference.

Also, the body of your function is full of redundancies: it's exactly equivalent to simply this:

1
2
3
4
void insert(unordered_map <string, int>& map1, const string& word)
{
    ++map1[word];
}
Last edited on
Why are you using unordered_map? Use map.
You should pass a reference of the map to the function instead. When you pass a map to parameter one of 'insert', it creates a duplicate of the map for the function to use. This isn't a problem with small amounts, but when you get towards 750,000 entries... You can see how creating a duplicate of a container all 750,000 times you call the function could drastically reduce performance

1
2
3
4
5
6
7
void insert(unordered_map<string, int> &map1, string word)
{
    if(map1.find(word) != map1.end())
        map1[word]++;
    else
        map1.insert(unordered_map<string,int>::value_type(word,1));
}


the above code uses a reference for map1.

to use it:
1
2
3
unordered_map<string, int> theOriginalMap; //the already declared map
string word; //filled in with the next word from the text
insert(theOriginalMap, word);


By passing it a reference, the function edits the already existing map instead of creating a new one.

Try that and see if it improves your performance. (i'm almost certain it will.)

EDIT:
@vlad from moscow
unordered_maps are faster than maps to access members by their key, which is what is needed to be done here.
Last edited on
for a large number of small strings, unordered_map is indeed the better choice. E.g. on my computer, reading a file with 250k words into a unordered_map<string, int> using that function (in the variant I posted) takes 0.15 seconds, while reading into a map<string, int> takes 0.22 seconds.
(but of course, there are specialized data structures for strings that should perform better, if insertion remains a bottleneck)
Last edited on
@Thumper

How can I return a map using pass by reference? I need to return a map that has put the values in it. When I try to use a return function it gives an error.
Last edited on
That's the beauty of a reference, you don't have to.
When you pass the map by reference, you're actually altering what i called 'theOriginalMap' in the function 'insert'.
Try it out.

EDIT:
@Cubbi
Off-topic, but I wonder what affect having a solid-state drive would have on that measure. *proceeds to test it out.*
what's the quickest way to put 250k words in a text file? :)
Last edited on
@Thumper Thanks a lot! We learned about different ways to pass parameters but we haven't really applied to so I guess it caught me off guard. I got it working.
@WesleySnipes
Glad to hear!
I'm curious to know how much the amount of time it took for your program to execute decreased.
@Thumper
For a ~500,000 word text file it went from taking too much time for me to wait to taking .073 seconds.
Well.. That's pretty significant then.. Wow! :)
Glad it worked out for you.
Topic archived. No new replies allowed.