Memory usage and algorithm efficiency

How can I check how much memory my program is using? I have an assignment which asks me to make sure I'm not exceeding some amount of memory but I don't know how I'd do that on Windows 7.

Also, my program parses through a text file and puts each word in a vector so that I can later go through the vector and "locate" where the nth occurrence of a word is by keeping track of a counter and incrementing the counter each time I see the word in the vector. Is this a vector a bad data structure for this? I'm aiming at keeping the efficiency at less than O(n^2).
on windows 7, Resource Monitor will tell you how much memory any program is using.
closed account (D4S8vCM9)
Vector is the best choice for your case, since it is (in standard implementation) based on an array and its performance at locating the nth element (constant, see http://www.cplusplus.com/reference/vector/vector/operator%5B%5D/).

To get even more perfomance resize it large enough at contruction. The file size could be a good starting pointer.
@Velnias75

I have to disagree a bit - the OP wants to use a find algorithm (not locate the nth element), for a vector this is linear efficiency as the algorithm worst case scenario might have to visit all of the elements.

Much better would be a <set> which has O(log n) efficiency for the find algorithm.

http://en.wikipedia.org/wiki/Associative_containers_%28C%2B%2B%29


However there might be better algorithms than searching through the whole set for each word in the file - performance will worsen with the larger number of words in the file.

The following is just an idea of mine - there are probably even better algorithms out there.

1. Create an array of 26 <map>'s - one for each letter of the alphabet. A <map> is a pair of values, the first being the lookup, and the second being the value associated with it. Maps use AVL trees internally they are also quite efficient with O(log n).

1
2
std::map<std::string,unsigned int> WordMapArray[26];


http://www.cplusplus.com/reference/map/map/insert/


2. When the next word is read from the file, see if there are any entries in the map that corresponds with the first letter of the word (make sure that comparisons are case insensitive). If not, add the word and it's count of 1 to the map. Otherwise see if the word is already in the map using the find algorithm, if it is increment it's count, if not add it to the map.


This approach has the advantage of dividing the problem by 26 and limiting the work the find algorithm has to do to the highest frequency word which should be a small fraction of the total word count, with most of the other words being even less. It's hard to say what the average performance would be (often the case with algorithm analysis), but it is obvious that it is much better than visiting the accumulating word count each time.

HTH & have fun !!! :D

He wants to find where the nth occurance of a word is. I would use a map of vectors.

map<string, vector<unsigned> > // map<word, vector<occurance> >

Then you only have the lookup time of the map to consider and you then can just do an array lookup on the vector for the nth element. This is also going to be the most memory efficient because each string is only stored once.

Last edited on
Ok, maybe I misunderstood a little. Good work Zaita, I have been promising myself to learn to read properly for a long time !!!

Say we have one of the words being "Fred", it appears 37 times in the file, and the 17th occurrence is word number 13091 in the file. We want to be able to do this for each distinct word in the file.

Is this the situation?

You could still use a combination of Zaita's solution and mine (to achieve extra efficiency):

std::map<std::string, std::vector <unsigned int> > WordMapArray[26];

This solution should be scalable - try it on a file with 1 million words. It would be good education I think to compare 2 solutions - one with the array, the other without. Try different sized files - 50,000 & 200,000 & 500,000 & 1,000,000 and see what happens.

My other mistake was that <map> entries are unique, so there is no need to use the find algorithm on them - this is done internally by the insert function. However it is still worth it to have an array of 26 maps to make much less work for the insert function.

A map only stores unique keys, but the insert function returns a iterator to the key if it is already there - you can use this to push_back the location of the word into the vector.

This solution should be scalable - try it on a file with 1 million words. It would be good education I think to compare 2 solutions - one with the array, the other without. Try different sized files - 50,000 & 200,000 & 500,000 & 1,000,000 and see what happens.

Hope all goes well.


Last edited on
My other mistake was that <map> entries are unique, so there is no need to use the find algorithm on them - this is done internally by the insert function. However it is still worth it to have an array of 26 maps to make much less work for the insert function.


I should think using an unordered_map would be more appropriate.
@TheIdeasMan: Your algorithm with the set would also require O(n log n) time to load the data, so its overall efficiency would be O(n log n)
theranga wrote:
Your algorithm with the set would also require O(n log n) time to load the data, so its overall efficiency would be O(n log n)


The whole process does have to be done n number of words times - so this is what you were saying with O(n log n), but it's not as simple as that. I was talking about the efficiency of the find algorithm, not the overall efficiency which is very different.The frequency of the words is unknown and variable - depends on the file being analysed.

And I proposed using an array of 26 maps, not a set.

So for some scenario's.

1. File size - 1 million words. Data structure - 1 <set > of all the words.

The amount of time taken varies for each word added, and depends of whether the word is already in the set or not, and on how many words there are already.

The time taken to analyse a word varies because:
- The set starts with 1 word and increases up to it's maximum, so n varies;
- If the word is not in the set - this takes the longest, but the find algorithm run by the insert function is O(log n); if the word is already there, then it takes a somewhat less time to find it, but could be O(log n/2) on average;
- The frequency of the words means there will be much less words in the set than what is in the file - say there were 500 different words in the file - so max map size is 500 not 1,000,000.

2. To make the math easy say there are 500 distinct words in the file and they all have the same frequency (2,000). Furthermore the distribution across the word's first letter is the same (~20). It's very hard to guess what these might actually be, but the logic is the same. Also we now use the data structure:

std::map<std::string, std::vector <unsigned int> > WordMapArray[26];

The ideas mentioned in item 1 still apply, but now each map has only 20 words in it, with frequencies of 2,000. So this much better than a map with 500 items.

3. The same as scenario 2, but the distributions are more realistic. So now for words whose first letter is much less frequent, the map will have much less items - maybe none, so this provides quite a saving, but is offset somewhat by words with a higher first letter frequency.

It would be interesting to compare cire's suggestion of an unordered set, which uses hashing for it's algorithms. I guess it depends on the hashing strategy - I would not be least surprised if it was better than my little idea.

Hope all goes well.
ok
Topic archived. No new replies allowed.