I'm doing tests these days which requires handling many, many files. While reading is not the most expensive part of each test, it still takes a considerable amount of time and I'd like to reduce that.
The format of the files is very easy, but there's some variation I have to be able to handle in the notation of the files:
Line 1: <int: number of lines coming -1> <int OR float, conceptually int> <OPTIONAL int OR float, conceptually int>
Line 2 to Line N+2: <int: counter/junk> <float> <float> <int OR float, conceptually int>
So, as you see, the problem is that even though some values are conceptually ints, they are given in float form (e.g. 300.000), and on top of that there's an OPTIONAL parameter that, if given, can be an int or a float (but should be interpreted as an int).
Right now I'm using filestreams, strings and stringstreams to extract data, but I feel there may be faster ways since the format is rather fixed/predictable and the number of lines is known from the first line.
Any ideas on how to use this to speed up the reading? Or is the fstream/stringstream way as fast as other methods?
How are you storing the data in memory? (Or if we could see something of how the data is being received.)
A number of existing STL implementations are pretty slow when handling translations from string to int, say...
You might be able to get some better speed by being careful how you read the data. Conversion to float is expensive no matter how you do it. For integer fields, read them as integer then just skip the stuff after the decimal point:
1 2 3 4 5 6 7
f >> int_value;
f.ignore( std::numeric_limits <std::streamsize> ::max(), ' ' );
f >> float_value_1 >> float_value_2;
f >> int_value;
f.ignore( std::numeric_limits <std::streamsize> ::max(), '\n' );
Is ignoring cheaper than just reading it into a dump value? Now, I read in e.g. f >> int_value_a >> int_value_b >> int_value_b;
Where the first _b reads the ".000" part and then overwrites it with the integer part of the second number. In my previous topic someone said that the >> operator doesn't do anything if there's nothing to read, thus the code fragment above would read both the (int int) and (float float) formats correctly.
In any case, some more info on the usage of data:
1) The first line is important data, but those are just two/three numbers.
2) All next lines are "temporary" data, as they are coordinates, which are just transformed into a distance matrix. Now, I read them in two vectors (x and y), create the (permanent) distance matrix and discard the vectors. As the size is known, I can use .resize() and operator instead of .push_back().
3) I can't forgo temporary storage entirely because of the n² nature: I need the first and the last number at the same time.
What I'm mostly hoping to get rid of is the whole filestream -> getline to string -> string to stringstream construct, which sounds incredibly expensive considering the simple format of each line.
> What I'm mostly hoping to get rid of is the whole
> filestream -> getline to string -> string to stringstream construct,
> which sounds incredibly expensive considering the simple format of each line.
There's a convenient function, similar to boost::lexical_cast that uses Spirit inside of the sandbox. It's in a single file so you should be able to just move it into your directory. It's called coerce_cast.