I need to load text files faster...

Hello,

I have text files consisting of tens of millions or hundreds of millions of 4 column lines. I have a program that plots the data in the files, but loading the files is exceptionally slow. Right now, the program loads in two steps.

First, it transfers 2 of the 4 columns to a new temporary data file, since the user only needs to look at two of the data files.

Second, the program loads each column from the temporary file into a vector, basically a std::vector<double> x and std::vector<double> y.

Right now, 100 million lines is taking about 15 minutes to perform the two steps above. I realize that one obvious thing I can do to cut down on this time is to skip step one above, since the data is being loaded into a vector anyways. I could do this, but I'd like to avoid it because it's possible that the system the program is running on does not have enough memory to store all the data in the vectors. That's why I create the temporary file; if the system can't store all memory in a single vector it will load data from the temporary file into the vector dynamically as the data in the vector goes out of range.

Any ideas? Thanks!
One of the most common mistakes people make using large vectors is to let the run time constantly reallocate the vector. Every time this happens, the run time has to allocate a new memory space for the larger vector, then copy the old vector to the new space and then release the old space. This is very time consuming.

If you have an idea how large the vector needs to be, use vector::reserve to allocate a sufficiently large vector.
http://www.cplusplus.com/reference/vector/vector/reserve/?kw=vector%3A%3Areserve


What is different about the 2 columns that are read in vs. the 2 columns that are written to the temporary file?

Also, it won't save you any time or space, but it might be a little bit better to encapsulate the x and y coordinates into a pair (or your own Point class). You could then store your data in:
std::vector<std::pair<double, double> > points;

I/O is slow. Unless you get rid of the write, I'm not sure you can speed things up appreciably. Can't you just skip writing the temporary files, and if you need the backup data just read the original file again, but grab the other columns?
Thanks for the responses. I did some simple benchmarking and the biggest time consumer is the read and rewrite operation (which makes sense).

One of the most common mistakes people make using large vectors is to let the run time constantly reallocate the vector. Every time this happens, the run time has to allocate a new memory space for the larger vector, then copy the old vector to the new space and then release the old space. This is very time consuming.

If you have an idea how large the vector needs to be, use vector::reserve to allocate a sufficiently large vector.
http://www.cplusplus.com/reference/vector/vector/reserve/?kw=vector%3A%3Areserve


I'll try this; putting the data in the variables still takes an appreciable amount of time and this might speed it up.

I/O is slow. Unless you get rid of the write, I'm not sure you can speed things up appreciably. Can't you just skip writing the temporary files, and if you need the backup data just read the original file again, but grab the other columns?


This is how I originally had things. The problem is that the data is not serialized very well... each column in the data file may consist of a different number of digits. For example, there is a time column and some of its members may look like this:

1
2
3
.00001
1.005
1093.29031


This means that I can't very easily seek to specific lines in the data file, since the lines have a variable number of characters. To get to a certain line in the data file, I have to call a readLine() function, and it is very expensive time-wise. So, let's say that I have a 100 second file with a 100 kHz sample frequency. This means the file has 10 million data points. If I want to access the 9 millionth data point, for example, I have to call readLine() ten million times. Following the example, if I partition the large file into say, 5 files, then to get the 9 millionth data point I only need to open the 4th file and readLine() one million times.

That was my rationale, at least.
Take out the write operation and see how long it takes to run just reading the data, doing the calculations without writing.

Writing takes more time than reading data, so if it takes 5 mins to read and 10 to write, that would suggest everything is about normal. You can improve performance using a SSD drive.

Next, Make a new program to read the file and write it to a new file with Nothing else being done.
It should finish faster and that is how much performance you can gain in total by tweaking your code.

If you have 2 hard drives, you can get a little better performance reading from one and writing to another. I don't know how much, but one drive can't do both at the same time. I would expect 10-20 %.


Last edited on
Topic archived. No new replies allowed.