Am I doing this wrong? (working with massive data files)

Hello,

I am currently creating an application in Qt Creator that will plot data and perform some manipulations/searches on the data. As stated in the title, these data sets are huge; they usually consist of ~5 million data points. The files containing the data are in a plain text format with 2 columns for the x- and y-coordinate of a point. I'll explain what I've done so far and my reasoning for doing so. It'd be great to get some feedback on whether this method I'm using is a good way of doing things. I've come up with it entirely on my own, so I can understand if it is not the best method :)

The way I've been managing these large data sets so far is by first breaking each massive file into several cropped files. I do this because it is much faster to access rows in a small file than a large file. For example, if I kept the large file and wanted to look at the last data point, I would have to scan through a number of rows equal to the number of data points (on the order of 1 million row-reads) just to get to those data points. By breaking down the large file, if I want to access the last data point I just need to first identify which of the broken-down files it is contained in, then scan over all of the lines in that smaller file.

Now, once the files are cropped I need to actually work with the data. To do this, I read a segment of the data into two vectors, one representing the x-coordinate and the other the y-coordinate of the points. Once the data is read in, I only need to work with the vector that contains the points. Whenever I need to look at a part of the data that is out of range of the two vectors, I just read the new range from my text files into the vectors.

That should give a general idea of what I'm doing, but if I need to provide more details I am happy to do that. Also, I should add that the program is working right now but is somewhat slow. Just loading a file of the size stated above takes about 1 minute, and moving the plot range quickly is pretty choppy. Ideally, I'd like to make it faster.

Any thoughts?

Thanks!
> they usually consist of ~5 million data points.
> The files containing the data are in a plain text format with 2 columns for the x- and y-coordinate of a point.

How much memory would it take if you read all the 5 million points into a vector?
Can't you spare 50 or 100 MB for this?
http://coliru.stacked-crooked.com/a/26449d000899201f

Hint: For fast lookups, sort the vector frrst; then binary search with std::lower_bound() and friends
http://en.cppreference.com/w/cpp/algorithm/lower_bound
Last edited on
Thanks for taking the time to respond.

How much memory would it take if you read all the 5 million points into a vector?
Can't you spare 50 or 100 MB for this?


I'm not really sure; the computers running this program should all have at least 4 gigs of RAM. To be honest, and thinking about this now it sounds really noobish, but when I started out writing this program I just sort of had this assumption that the total number of data points couldn't be stored in a single vector. It just seemed too large. It very well could be the case that they all fit nicely inside a single vector. The problem is that I really have no understanding of how a computer's memory management works and so I don't know if the system will have that much memory to spare.

Hint: For fast lookups, sort the vector frrst; then binary search with std::lower_bound() and friends
http://en.cppreference.com/w/cpp/algorithm/lower_bound


I forgot to mention that the data I'm plotting is a time-series sampled at a constant rate, so the data comes in already sorted. In that case, unless I'm misunderstanding the second link you posted, I don't think a fast lookup is necessary. I'll keep that in mind for the future, though, and please correct me if I am wrong about this.
Last edited on
If the data comes in already sorted, you wouldn't need to sort it yourself.

A fast look up (binary search) would not be required to move the plot range quickly; the position in the vector can be computed directly (as points are sampled at a constant rate).
As JLBorges pointed out, the best thing is to read the data into a vector and then process it from there. If you know the approximate size of the data then you can pre-size the vector. It will still expand as needed but this will help.

Let's take a step back: do you need random access to the data, or is it serial? In other words, do you read a point, process it, and move on to the next point, (or maybe read a few points and then move on) or do you need access to all the points? If the access is serial then you should process it serially rather than trying to access it in bits and pieces.

Another point: I'm a little surprised that it takes a full minute to read 5 million data points. The reading part should happen as fast as the disk can read the data. If you're using C++ streams then you might want to give the file a larger streambuf.

A final possibility: create an auxilliary program to read the input data and write it into a binary file instead of a text file. Then you can seek around the file all you want. A vector would still be faster, but this is an option.
Last edited on
Topic archived. No new replies allowed.