Create file to access in program

In the program I've written, I stream data into a vector<struct> from some files up to sizes of 3GB. Each time I run the program it has to read the file and stream the data. I'm still learning to program C++, but how could I create a library, maybe?, to store this data as a vector<struct> and have the program use that file?

Not sure what steps to take.

Thanks
Last edited on
you can store it as a file (prefer binary), and read that, and load up your vector of structs very efficiently.

you can make it static data in a library, but I recommend against this unless you have an exotic reason to do so. This makes changing the data much harder, and is less reusable, and so on.

as a new coder, you need to understand the pointer and file problem.
do you understand pointers yet?

if you do, the thing is, if you write a struct that contains a pointer to a file, you just write the POINTER to the file, not the DATA IT POINTS TO. And, this can be hidden by your own classes or by the STL containers: writing a std::string to a file does NOT write the string's text, it writes the string OBJECT which has a pointer, but no DATA in it. You have to manually fix this in your code. One way to do that is to use C type storage in your struct, so you can read and write the file very fast, and promote the items inside it up to c++ containers when you need to (on demand, rather than all at once, perhaps). Another way is to make a more complicated read and write that promotes as it reads.

Last edited on
I understand pointers well enough to do what you're saying at least. Didn't actually think of that. I'm capable enough to do what you're saying, just a couple of questions if you don't mind.

How would I store this binary data inside of the file and how would I go about getting the pointer to point to the correct data inside the binary file?
if your struct has no pointers,
you can just write it to the file directly, no fuss.
file.write(struct*, sizeof(struct), number of structs);
for a vector of them that is
file.write(&(vec[0]), … )

if it has pointers, … let me see what you have and I will get you started with it.

I HIGHLY recommend you do a small play-program to learn what this is all about, and work it into a bigger program after learning it.

personally, I go with C-data structs that can just be block read and written, and dodge the issue.
Last edited on
Simple struct with a few members...

1
2
3
4
5
6
7
struct Data
{
     unsigned int date, time, mil;
     double recent, back, front;
     unsigned int totals;
     unsigned short green, red, hour, min, seconds;
}


I have no idea where to start to have this in a file that my program can just analyze from without streaming into the vector. If you could point me in the right direction, I could figure everything else out.
@lumbeezl,

Personally, I'd suggest you look into boost::iostreams or boost::interprocess for their two different ways of handling memory mapped files. It's portable between operating systems.

Basically the idea is this:

You open the file, map it, and you get a pointer back. That pointer is attached to virtual memory. The operating system treats the file as virtual memory, so it could be said the file behaves as if it is already in RAM.

You don't read from the file. You operate upon the file as if it were one large block of RAM, 3G in size. If you know pointers, and your file is just an array of these Data structs, you'd basically turn the file into an array of these structs.

It's also faster than reading data from a file. Much faster.

For one, you have no read step to complete before the data is available.
Second, the OS operates VM in extremely efficient methods, so you are actually able to pull data from the disk faster than any of the streaming or other I/O methods.

If your data is larger than available RAM, there are "views" available in the two libraries, where you map a portion of the data (say, the first 1 Gbyte), then map a vew of the 2nd Gbyte, then the 3rd (and so on).

I've used this technique for years in various forms (though the boost library is not as old as my work with this concept, I did it long ago using "native" OS functions for this, in Windows, UNIX and Linux.)

I've used both boost methods (I preferred the the one from interprocess, but both have merit).

Be aware, this is a bit of a "raw" method. It invokes a powerful feature of the operating system, and isn't absolutely perfect, but you may be convinced that ALL file I/O should work this way if you get accustomed to it.

Quick case in point. I had a thread somewhere that posed this option as a replacement for streaming in strings from a 40 Mbyte file (maybe larger). No one was interested, arguing it was the iostreams, not the file handling. I agree, but with a caveat, when combined, the results are inarguable. In ever finished posting to that thread, but I did conduct the experiment.

I fashioned a class to represent the file in VM using boost. I created a 20 Gbyte file of data fashioned from concatenated source code (lots of source code).

I then timed the read using the simple "while( infile >> s )", where infile was a stream opened on that file, and s is a string...which I let go of, because I just wanted to time the reading of the file.

I performed the same work with my class based on memory mapped files, and again a simple loop like that one, but reading from the file into a string, breaking the read on linefeed/carriage return.

The result is basically this. On a typical Haswell PC in Windows, the fastest the standard stream could read strings amounted to about 150 Mbytes per second, no matter if the drive was SSD (able to sustain > 400 Mbytes per second) or a an HD (maxed out at around 180 ). In all case, 100% usage of a CPU core limited the read speed to 150 Mbytes per second.

The memory mapped file version pulled that in at 900 Mbytes per second.

You might notice, if you're thinking, what disk runs at 900 Mbytes per second? Few do. However, when you run timings, you run the task several times. I have 64 Bytes of RAM, so the 20 G file fit into cache. I toss out the timing of the first run, and time only the subsequent runs on average.

What that means is that a memory mapped file used to load data is more efficient. If you have a drive that can sustain > 300 Mbytes / sec on modern hardware, you might not ever reach that speed using standard streams.

Some of the other, older file operations are faster than streams(fread for example), but they still aren't as fast as mapped files. fread pulled in the data (without proper separation into strings) at no more than 600 MBytes / sec on the same hardware. The memory mapped approach pulled that in at 900 Mbytes/sec, properly separated into strings.




Last edited on
@jonnin

I'll definitely practice that and thanks for the tips. I'm just not sure how the program will be able to access the different elements (and their members) after the file is written.

@Niccolo

I'd love to be able to map it because the speed is awesome. I don't know anything about boost libraries or memory mapping so I'll have to do some studying. The idea seems simple enough though. I have the same issue with streaming data, it just takes a lot of time. Thanks for the tips though
You might know a little more about boost that you realize.

Ever heard of shared_ptr?

That started independently, made its way into boost, was "the" shared_ptr for a while, then from boost became part of C++11.

Tried std::filesystem?

That was in boost before it became part of the c++ standard library.

Several came from boost.

You might need a little help getting up and going, but it isn't really all that tough - except for the first time you've ever used it.

Much of boost are written as header only libraries, which is the simple part.

Part of boost must be built. They have their own command line oriented approach, but it depends on your platform and compiler(s) (so...what do you use?).

If you decide to experiment with it, post back here...I'll dig up the more recent work I've done with memory mapped files for some pointers (....now that's funny).

I'll definitely practice that and thanks for the tips. I'm just not sure how the program will be able to access the different elements (and their members) after the file is written.

you read the (entire?) file every time you start the program. Reading it will put the data right back into the format you are used to. Alternately you can read any record (one struct) from the file that you needed if you do not need the whole thing. Or you can read parts of it at a time if you don't want to run a 4gb of memory program, and want to say limit it to 1/2 a gb at a time or something. Unclear what you need.

memory mapped is better, if you want to make that large a leap all at once.
There used to be pre-built boosts out there as well. I haven't kept up. This may be only for windows, but its easier to build on unix.
Last edited on
I do need the entire file for the analysis and I do have it read the entire file when I start the program. I was just thinking that I could store the data (permanently possibly) in a file and then have the program access it as needed without streaming it into the vector every time it runs.

From the little I've read so far, it seems that memory mapping will allow me to store the data but I'll still need to be able to manipulate it once it's mapped, not sure if that's possible.

Just trying to figure out how that data would be stored and how the program would identify which data are in their respective elements of the vector and struct members.

If this is confusing, I could give some examples of what I'd need it to do if there's any more advice you could give.
Some examples would be helpful, yes.

As to memory mapped files writing (storing) data, yes, it does that. There are some caveats. Technically, the file has to be at least the size as the data mapped. Put another way, it takes some effort to start with an empty file and expand it (more than it does to just write to a stream, but certainly something that can be done).

Once the file is large enough to store all of the data, however, memory mapping works to read from and write to the file automatically. Whatever is written to the RAM mapped to the file ends up being written to the file. It's as if you could connect a large array (as an example) to a file directly, without having to perform reading and writing using a file object or handle.

Let's say we're talking about this:

1
2
3
4
5
6
7
struct Data
{
     unsigned int date, time, mil;
     double recent, back, front;
     unsigned int totals;
     unsigned short green, red, hour, min, seconds;
}


Note, it's missing a semicolon (a direct copy of your post).

I assume that since you've discussed loading this into a vector, what you have is basically this struct repeated (obviously with various data assigned) millions of times (I'm guessing about 45 million if these are 64 bit integers, perhaps 55 million or so if they're 32 bit integers - depends on the compiler even targeting a 64 bit application).

So, a file would be set to 3 Gbytes in size (accommodating the 40-60 million structs - it's a simple command), which would then be mapped as 3 Gbytes (assuming you have that RAM on the target machine and we're building a 64 bit application).

At that point, you would treat the address of the memory as a container, perhaps an array (it's nasty to use a std::vector in this situation). You might want the first 64 bit integer to contain the total number of entries (you could then re-use this on subsequent executions as a persistent container). There could be other "header" data as well if you need.

Then, just after that "header" data, you establish a pointer which is declared as a Data *, and use it as one large array of 40-60 million structures. Read/write to them at will, they'll be written to disk, read from disk as required.

That would be it. You need do nothing else. Well...you have to be sure you don't run past the end of the file, like any container, but when your application closes and this map object is closed, the file has the data you wrote as if it were one large array (and, that header I assume you'd use).

You might finish thinking it's a kind of magic. Files without writing to a file, just writing to a large RAM buffer cast as an array pointer.

Last edited on
Something along these lines, perhaps (uses boost):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <iostream>
#include <type_traits>
#include <fstream>
#include <boost/iostreams/device/mapped_file.hpp>
#include <cassert>

struct data
{
     unsigned int date, time, mil;
     double recent, back, front;
     unsigned int totals;
     unsigned short green, red, hour, min, seconds;

     friend std::ostream& operator << ( std::ostream& stm, const data& d )
     {
         return stm << "data{" << d.date << ',' << d.time << ", ... "
                    << d.recent << ',' << d.back << ", ... "
                    << d.totals << ", ... " << d.red << ", ... }" ;
     }
};

static_assert( std::is_trivially_copyable<data>::value ) ;

bool write( std::ofstream& file, const data& d )
{
    static constexpr auto NBYTES = sizeof(data) ;

    const auto ptr_bytes = reinterpret_cast< const char* >( std::addressof(d) );
    return bool( file.write( ptr_bytes, NBYTES ) ) ;
}

void create_test_file( const char* file_name, const unsigned short n )
{
    std::ofstream file( file_name, std::ios::binary ) ;
    for( unsigned short s = 0 ; s < n ; ++s )
    {
        const double d = s + double(s) / 5 + double(s) / 50 ;
        const data dat{ s, (unsigned short)(s*3), s, d, d*2+5.5, d, s, s, s, s, s, s } ;
        std::cout << s << ". writing " << dat << '\n' ;
        write( file, dat ) ;
    }
}

int main()
{
    const char* const file_name = "data.bin" ;
    const unsigned short N = 10 ;

    std::cout << "creating a test file\n" ;
    create_test_file( file_name, N ) ;

    // https://www.boost.org/doc/libs/1_50_0/libs/iostreams/doc/classes/mapped_file.html#mapped_file_source
    boost::iostreams::mapped_file_source fmap(file_name) ; // a read only map
    if( fmap.is_open() )
    {
        // in practice, this assertion would be superfluous
        assert( std::size_t( fmap.alignment() ) >= std::alignment_of<data>::value ) ;
        std::cout << "\ntest file is memory mapped\n" ;

        // treat the mapped memory as an array of N objects of type data
        const data* array = reinterpret_cast<const data*>( fmap.data() ) ;

        std::cout << "\naccessing elements in mapped memory\n" ;
        for( unsigned short i = 0 ; i < N ; ++i )
            std::cout << i << ".  access " << array[i] << " at address " << array+i << '\n' ;
    }
}

http://coliru.stacked-crooked.com/a/10d272e1ff147060
@JLBorges has a good take on this above.

One thing I'd suggest, when extending the size of the file, to save time, try ::ftruncate (Linux) or boost::interprocess::winapi::set_end_of_file. Both, on their respective operating systems, expand the file without spending the time writing to it. It works as well, ends up defaulting the memory read in as zero filled, but doesn't spend the time writing and therefore happens in a snap.
Last edited on
Topic archived. No new replies allowed.