Open ~500 files with ofstream

Hi,

I have one single text file with approx. 5GB coordinates and height data. Now I want to sort and filter this data for certain criterias and write them into ~500 different textfiles. Since I do not have the possibility to save the whole data in RAM I need to process the data directly from file to file. Until now I have for output one single ofstream and I always open, write and close the right file. The big problem is the .open() operation which always takes a extremely long time, especially if the files are >20 MB. In conclusion I would like to have all files opened permanently. The question is how to do this?
Can anyone provide a example for a list of files or similar so I can acces
File1
File2
File3
File4
....
by a command like:

string filename = "File" << n;

?

Thank you
Think about how you will decide which file to write the data to as you process it. Create an array or map of ofstream based on this.

Note that you may run into an operating system limit on the number of simultaneously open files. UNIX based systems will let you expand the limit on a per-process basis.

You mentioned sorting the data. 5GB is a lot of data so try to sort it once into another file before you write your processing code. That will make debugging easy.

Finally, avoid using endl (e.g. file[300] << "This is slow" << endl; because it flushes the data. Buffering will be critical to the performance.
> I would like to have all files opened permanently.

As far as possible, limit the life-time of objects to the period when the object is required.
In this case: even if the implementation allows it, do not keep hundreds of files open simultaneously.


> Until now I have for output one single ofstream and I always open, write and close the right file.
> The big problem is the .open() operation which always takes a extremely long time

This is implementation dependent: creating a different object of type std::ofstream for each file could be faster.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <fstream>
#include <string>

int main()
{
    std::ifstream input_file( __FILE__ ) ; // this file

    int n = 0 ; // output file sequence number
    const std::string file_name_prefix = "output_" ;
    const std::string file_name_suffix = ".txt" ;

    std::string line ;
    while( std::getline( input_file, line ) )
    {
        // a different ofstream object for each file
        std::ofstream output_file( file_name_prefix + std::to_string(++n) + file_name_suffix ) ;
        output_file << line << '\n' ;
    }
}


If that is not fast enough, work with two (just two) std::ofstream objects which you use in a round-robin manner.
While writing to one std::ofstream, close and reopen (for the next file) the other std::ofstream asynchronously.
See: std::async() http://en.cppreference.com/w/cpp/thread/async
even if the implementation allows it, do not keep hundreds of files open simultaneously.

Certainly if you can close some of the files early then you should do it, but opening and closing the files for each write will most likely kill your performance. If you need to access the files then I'd say leave them open, especially if the OS allows it.

Keep in mind that hard disks are a MILLION times slower than RAM. Various caching mechanisms mitigate this, but they don't eliminate it.
Topic archived. No new replies allowed.