Sorting a huge text file

Hi,

I have huge text files that I would like to sort... they range from 90,000 KB and 120000 KB. I have an access to a machine with 32 GB... what would be the best way to do the sorting?

Thank you...
Well, I have wrote the following code:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;

int main ()
{
ifstream indata;
ofstream outdata;
string line;
vector<string> myvector;

indata.open("file1.txt");
outdata.open("file2.txt");

if (!indata)
return 0;

while(getline(indata,line))
myvector.push_back(line);

sort (myvector.begin(), myvector.end());

for (vector<string>::iterator it=myvector.begin(); it!=myvector.end(); ++it)
outdata << *it << endl;

indata.close();
outdata.close();

return 0;
}

It works perfectly with small text files... I'm looking for a better code with huge text files... Actually, the above code is still running on my huge text file... I really don't know if it can handle it or not but in any case, I would really appreciate if someone can give me a better way...

Thank you...
This is pretty much what I was going to suggest that you try, before refreshing the page and noticing your second post.

I really don't know if it can handle it or not but in any case, I would really appreciate if someone can give me a better way...

For performance your current code is already very good, in my opinion.

What you could do would be to reserve() memory in myvector.
This should improve performance a bit because repeatedly using push_back() won't cause memory reallocations.

http://www.cplusplus.com/reference/vector/vector/reserve/

1
2
3
4
5
6
vector<string> myvector;

myvector.reserve(HOW_MANY_WORDS);

// where HOW_MANY_WORDS is approximately the
// maximum number of words you expect to read 


Other minor suggestions would be, clean your code a little bit, and be sure to turn on compiler optimizations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ifstream indata("file1.txt");
ofstream outdata("file2.txt");
string line;
vector<string> myvector;

myvector.reserve(1000); // you can probably set this higher, if you have 32 GB of RAM
indata.open("file1.txt");
outdata.open("file2.txt");

// ...

for (vector<string>::const_iterator it=myvector.begin(); it!=myvector.end(); ++it)
    outdata << *it << endl;

indata.close();
outdata.close();


See your IDE's or compiler's documentation about how to turn on optimizations.
With GCC, you pass the -O3 argument to g++.
With Visual Studio, you build in Release mode.
Last edited on
Thank you very much for your reply... I will do your suggestions...
you could use ifstream::pos_type for random access. That is the only way I see it. Otherwise, I would be looking for some sort of hashing scheme.
Just as a side note, for very large files, std::vector is not the best choice. Use a std::deque instead. It won't cost you significant performance (if any), and will work better with system memory.
Topic archived. No new replies allowed.