Parse raw data fast with specific format

I am searching ways to read a text file fast, really fast. I have thought of many solution but could not reach to an optimal one. Let me describe my problem and then I shall write the things I already tried.

Problem statement:

Say I have an 10G text file and the format of that file is:

___PART_1___
*1 abc
*2 def
...
<5 million lines of this format>
*5000001 blah

___PART_2___
1 *1:1 *2:2 <value1>
2 *3:1 *4:3 <value2>
3 *4:2 *4:4 <value3>
<another 10 million lines of this format>
In _PART_1_, There are two columns, ID and NAME. In _PART_2_, There are 4 columns, Serial number, data1, data2, some_value

What we want from this huge file is to get the data before colons from data1 and data2 column. In this case, we would want

from 1st row of _PART_2_, extract *1 and *2, get the corresponding name from _PART_1_ which is abc & def in this case. from 2nd row of _PART_2_, extract *3 and *4, get the corresponding name from _PART_1_ whatever they are.

That is all information we want.

Things to keep in consideration before we jump to conclusion:

In _PART_1_, ID's might not be unique or consecutive and there could be any number of lines, 5 million is just an number.

In _PART_2_, It is certain that there would be an entry in _PART_1_ for the data before colons from data1 and data2 column of PART_2_.

Tried so far: Number 1 : I have tried keeping the _PART_1_ into map but since the number of entries are quite big, balancing would itself take much time. So, I confirmed myself on unordered_map. Will write a good hashing function as well for it. And then whenever I reaches _PART_2_, tokenize that line, get second/third token, tokenize them again and get the data. Finally, look for them into the unordered_map. Used boost::tokenizer to tokenizer.

Number 2 : Instead of boost::tokenizer, worked with regex_searches as well but they also seem to be slow.

Number 2 : Mapped files to memory using mmap but since the file is huge, my program ran out of memory sometimes.

Code snapshot, not full code:

typedef boost::tokenizer<boost::char_separator<char> > tokenizer;
typedef std::unordered_map<std::string, std::string> m_unordered;
typedef std::unordered_map<std::string, std::string>::iterator m_unordered_itr;

int main() {
m_unordered un_name_map;
m_unordered_itr un_name_map_itr;
boost::char_separator<char> space_sep{" "};
std::ifstream myfile("file.txt");
if (myfile.is_open()) {
std::string line;
bool part1_starts = 0;
bool part2_starts = 0;
while ( std::getline (myfile,line) ) {
if (line.find("___PART_1___") != std::string::npos) {
part1_starts = 1;
continue;
}
if (mapping_starts) {
tokenizer tok{line, space_sep};
tokenizer::iterator it = tok.begin();
std::string index = *it++;
std::string value = *it;
un_name_map.insert(un_name_map.end(), {index, value});
}
if (line.find("___PART_2___") != std::string::npos) {
part2_starts = 1;
part1_starts = 0;
continue;
}
if (part2_starts) {
tokenizer tok{line, space_sep};
tokenizer::iterator it_start = tok.begin();
// Ignore first token and advance
std::advance(it_start, 1);

// Split the second token which is my second column of ___PART_2___ vector<std::string> strs;
strs.reserve(2);
boost::split(strs, *it_start, boost::is_any_of(":"));
un_name_map_itr = un_name_map.find(strs[0]);
if (un_name_map_itr != un_name_map.end()) {
std::cout << "1. Name from the map is " << un_name_map_itr->second << std::endl;
}

// Split the third token which is my third column of ___PART_2___
// Similar code as above.
}
}
}

}


I am sure that there are better ways to reach to the mentioned solution. I look forward to all of them. Only and only thing I care about is 'SPEED'. I will be happy to write more details about it, if needed.
You should have learnt to use [code][/code] tags by now.

> Only and only thing I care about is 'SPEED'.
Your average instruction execution time is measured in nanoseconds, and your average seek time of a spinning hard disk is measured in milliseconds (that's 6 to 8 orders of magnitude slower). With a 10G file, your hard disk is going to be doing a lot of seeking, which means your code is going to be doing a lot of waiting.

This measures your lower bound on how fast you can read the file with trivial processing. Consider it a benchmark you measure Number1,2,3,n against.
1
2
3
4
5
6
7
8
9
std::ifstream myfile("file.txt");
if (myfile.is_open()) {
    std::string line;
    int nlines = 0;
    while ( std::getline (myfile,line) ) {
        nlines++;
    }
    std::cout << nlines << std::endl;
}

While it's doing that, use whatever process monitoring tools you have to see how much time is spent in your code. It won't be much.

So how often do you need to process a 10G file, like where is it coming from and who needs the results of your analysis? To me, it seems like a run once a day kind of problem.

If the benchmark above takes say 10 minutes, there is nothing you could ever do to make it run in 10 seconds. Accept that it's going to take 10 minutes and tell the user's to take a coffee break.

Similarly, if the benchmark is 10 minutes, your #1 approach is 11 minutes and your #2 approach is 12 minutes (but the code is a lot cleaner for you to maintain), then user's are not going to change their work pattern. It's still a coffee break interval to them.

If the ballpark is an hour, then they'll go to lunch instead.

They won't care about +/-10% around whatever baseline time it takes.

They WILL care about you being late delivering s/w because you're too obsessed about saving milliseconds on a program which takes minutes or hours to run.
Don't read 1 line at a time from your file. Read a large amount (like a meg) at a time. Process what you read, and when you consume it, read the next chunk. That will speed up your program significantly.

Allocate char buffer[<BIG SIZE>] and read from the file into buffer. Use an istream based on memory (see my post at the bottom of this thread: http://www.cplusplus.com/forum/general/226786/ )

You could write your own istream that is based on the memory istream, and every time a read encounters underflow, your class reads more from the file and continues. I am sure there are other corner cases you have to account for, but the general case should be easy to visualize.
reading the whole file into memory is an option too, most machines have 10gb of memory -- my personal one has 32, my work server has over 100.

Then split the file into N chunks and process it as N threads. Since it is one big chunk already this is just pointer offsets into it. If the lines are the same length or something its easy enough, if not you can go where you want to split byte wise and adjust the pointers looking forward/backward for an end of line character to fine tune it.

Then just hammer it with high performance code in the worker threads.
There may be a way to make part 1 and part 2 together (unclear) via threading or you may have to delay part 2 and do some sort of lookup.

I see no reason this task should take more than 5 - 10 min tops.
Last edited on
Ah, premature optimisation disease.
http://wiki.c2.com/?PrematureOptimization

Ah, premature optimisation disease.

It's OK - it happens to lots of programmers!

;)
In _PART_1_, There are two columns, ID and NAME
Is that a numeric ID? If there are 5 million lines then are the IDs restricted to 1 to 5,000,000? If they are numeric and not too sparse then you could use a vector indexed by ID instead of an unordered map.

If you must use an unordered map then be sure to construct it with plenty of buckets. If you expect around 5 million items then I'd suggest 1 million or 2 million buckets.

I believe that stream I/O is still relatively slow. Consider switching to C's FILE I/O. Set a large buffer size (maybe 1MB). Be sure to put big buffers on the output as well as the input.

To get a better lower bound on the speed, copy the file using the command line tools and time how long it takes. You may find that this is much faster than salemc's program that uses iostreams. If you can process a file in 2-3 times the time it takes to simply copy it then that's probably fast enough.

If that isn't fast enough then get more RAM and go back to memory-mapping the input file. You might still need FILE I/O for the output.

That is all information we want.

Exactly what is the output you need? Must it come in the same order as the data in PART 2? I'm asking because sometimes a subtle restriction can be exploited to optimize a program.

In _PART_1_, ID's might not be unique
What should the output be if there are multiple entries in PART 1 for an ID?
Topic archived. No new replies allowed.