Data type for multiple imported files

Hello everyone,

I am interested in importing multiple (relatively large) files. The question is what type of structure should I use to store them.
They all contain numbers of type double, millions of them. I will need to manipulate them, so I was thinking to use a map of vector<double> (pairing the filename with a vector that contains the file). Since they are huge, do you think putting them all in one vector is better, and then manipulate with iterators? Or use something else, other than a vector, like deque?

Typically, I need to import 5-15 files.
Any advice is appreciated.
Thanks.
Last edited on
A few key questions, which you didn't address, are:
1) How are the numbers organized?
2) How are you going to access all these numbers? Sequentially, randomly? Do you need to search them?
3) Are the numbers heterogeneous? i.e. Can they be intermixed? If not, putting them all in one vector makes no sense.
Ok, sorry.

1) They make up one column of a file. Importing them is not a problem, just storing them.
4.279151
4.282768
4.259093
4.242323
4.242323
4.262381
4.273561

2) I will access them as they are in a file, so its really important that they (their order) don't change. I will do mathematical operations on them, but storing results separately. I do not need to search them.

3) They must not be mixed. That idea is to put them in one vector (as one big file) but access actual files using iterators or pointers (not sure if its easy).

Thanks.
> I will need to manipulate them, so I was thinking to use a map of vector<double>
> (pairing the filename with a vector that contains the file).

Yes. You may want to call shrink_to_fit() on the vector after a file is fully read.


> Since they are huge, do you think putting them all in one vector is better, and then manipulate with iterators?

Manipulate with positions; iterators get invalidated when the vector has to grow by relocating the sequence.

No, AFAIK there wouldn't be a perceptible improvement in performance (because of better locality of reference) as the vector would be really huge; the increase in code complexity would not result in a concomitant performance benefit.

Ideally, measure the performance with prototypes for each of the design options.
I still haven't made my decision...
Should I use vector of vectors, or map of vectors?
Files are big, several will be imported and there will be lots of for loops.
Is there any difference in performance?
> Should I use vector of vectors, or map of vectors? Is there any difference in performance?

AFAIK, it should not make any measurable difference in performance.

Measure it. For instance, for computing the (compensated) sum of numbers in the file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
#include <iostream>
#include <type_traits>
#include <vector>
#include <random>
#include <ctime>
#include <map>
#include <string>
#include <iomanip>

template < typename ITERATOR > double compensated_sum( ITERATOR begin, ITERATOR end )
{
    static_assert( std::is_floating_point< decltype(+*begin) >::value, "expected floating point type" ) ;
    // http://en.wikipedia.org/wiki/Kahan_summation_algorithm
    double sum  = 0 ;
    double compensation = 0 ;
    for( ; begin != end ; ++begin )
    {
        const double term = *begin - compensation ;
        const double temp = sum + term ;
        compensation = ( temp - sum ) - term ;
        sum = temp ;
    }

    return sum ;
}

std::vector<double> make_vector( std::size_t n )
{
    static std::mt19937 rng( std::time(nullptr) ) ;
    static std::uniform_real_distribution<double> distribution( 0, 1000 ) ;

    std::vector<double> result ;
    while( result.size() < n ) result.push_back( distribution(rng) ) ;
    return result ;
}

int main( int argc, char* argv[] )
{
    // note: will throw (and abort) on invalid command line arguments  
    const std::size_t nfiles = argc > 1 ? std::stoull(argv[1]) : 8 ;
    const std::size_t numbers_per_file = argc > 2 ? std::stoull(argv[2]) : 4000000 ;
    std::cout << nfiles << " files, " << numbers_per_file << " numbers per file\n\n"  << std::fixed << std::setprecision(2) ;

    {
        std::cout << "one vector per file (map file name to vector):\n-------------------------\n" ;
        std::map< std::string, std::vector<double> > data ;
        for( std::size_t i = 0 ; i < nfiles ; ++i )
            data[ "file #" + std::to_string(i) ] = make_vector(numbers_per_file) ;

        const auto start = std::clock() ;
        std::vector< std::pair< std::string, double> > result ;
        for( const auto& pair : data ) result.emplace_back( pair.first, compensated_sum( pair.second.begin(), pair.second.end() ) ) ;
        const auto end = std::clock() ;

        for( const auto& pair : result ) std::cout << pair.first << ": " << pair.second << '\n' ;
        std::cout << (end-start) * 1000.0 / CLOCKS_PER_SEC << " milliseconds\n\n" ;
    }


    {
        std::cout << "a single big vector (map file name to offsets):\n-------------------------\n" ;
        std::vector<double> data ;
        std::map< std::string, std::pair<std::size_t,std::size_t> > offsets ;
        for( std::size_t i = 0 ; i < nfiles ; ++i )
        {
            const auto file_data = make_vector(numbers_per_file) ;
            offsets[ "file #" + std::to_string(i) ] = { data.size(), data.size() + file_data.size() } ;
            data.insert( data.end(), file_data.begin(), file_data.end() ) ;
        }

        const auto start = std::clock() ;
        std::vector< std::pair< std::string, double> > result ;
        for( const auto& pair : offsets )
            result.emplace_back( pair.first, compensated_sum( data.begin()+pair.second.first, data.begin()+pair.second.second ) ) ;
        const auto end = std::clock() ;

        for( const auto& pair : result ) std::cout << pair.first << ": " << pair.second << '\n' ;
        std::cout << (end-start) * 1000.0 / CLOCKS_PER_SEC << " milliseconds\n\n" ;
    }

    {
        std::cout << "one vector per file (vector of vectors):\n-------------------------\n" ;
        std::vector< std::vector<double> > data ;
        for( std::size_t i = 0 ; i < nfiles ; ++i ) data.push_back( make_vector(numbers_per_file) ) ;

        const auto start = std::clock() ;
        std::vector<double> result ;
        for( const auto& vec : data ) result.push_back( compensated_sum( vec.begin(), vec.end() ) ) ;
        const auto end = std::clock() ;

        for( std::size_t i = 0 ; i < result.size() ; ++i )
            std::cout << "file #" << i << ": " << result[i] << '\n' ;
        std::cout << (end-start) * 1000.0 / CLOCKS_PER_SEC << " milliseconds\n\n" ;
    }
}

12 files, 2000000 numbers per file

one vector per file (map file name to vector):
-------------------------
...
210.00 milliseconds

a single big vector (map file name to offsets):
-------------------------
...
200.00 milliseconds

one vector per file (vector of vectors):
-------------------------
...
230.00 milliseconds

http://coliru.stacked-crooked.com/a/52ddd230aee1a8b8


> Files are big, several will be imported and there will be lots of for loops.

The way these loops are written could have a critical influence on performance.


If the data sets from each file can be processed independently, running the computations concurrently would reduce the elapsed wall-clock time (at the cost of some increase in the use of processor time).

For instance:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <iostream>
#include <type_traits>
#include <vector>
#include <random>
#include <ctime>
#include <map>
#include <string>
#include <iomanip>
#include <future>
#include <chrono>
#include <thread>

template < typename ITERATOR > double compensated_sum( ITERATOR begin, ITERATOR end )
{
    static_assert( std::is_floating_point< decltype(+*begin) >::value, "expected floating point type" ) ;

    // http://en.wikipedia.org/wiki/Kahan_summation_algorithm
    double sum  = 0 ;
    double compensation = 0 ;
    for( ; begin != end ; ++begin )
    {
        const double term = *begin - compensation ;
        const double temp = sum + term ;
        compensation = ( temp - sum ) - term ;
        sum = temp ;
    }

    return sum ;
}

std::vector<double> make_vector( std::size_t n )
{
    static std::mt19937 rng( std::time(nullptr) ) ;
    static std::uniform_real_distribution<double> distribution( 0, 1000 ) ;

    std::vector<double> result ;
    while( result.size() < n ) result.push_back( distribution(rng) ) ;
    return result ;
}

int main( int argc, char* argv[] )
{
    const std::size_t nfiles = argc > 1 ? std::stoull(argv[1]) : 8 ;
    const std::size_t numbers_per_file = argc > 2 ? std::stoull(argv[2]) : 1000000 ;
    std::cout << nfiles << " files, " << numbers_per_file << " numbers per file\n\n" << std::fixed << std::setprecision(2) ;

    std::cout << "one vector per file (map file name to vector):\n-------------------------\n" ;
    std::map< std::string, std::vector<double> > data ;
    for( std::size_t i = 0 ; i < nfiles ; ++i )
        data[ "file #" + std::to_string(i) ] = make_vector(numbers_per_file) ;

    const auto start_wc = std::chrono::steady_clock::now() ;
    const auto start_p = std::clock() ;

    std::vector< std::pair< std::string, std::future<double> > > result ;
    const auto fn = compensated_sum<std::vector<double>::const_iterator> ;
    for( const auto& pair : data )
        result.emplace_back( pair.first, std::async( std::launch::async, fn, pair.second.begin(), pair.second.end() ) ) ;
    for( auto& pair : result ) pair.second.wait() ;

    const auto end_p = std::clock() ;
    const auto end_wc = std::chrono::steady_clock::now() ;

    for( auto& pair : result ) std::cout << pair.first << ": " << pair.second.get() << '\n' ;
    std::cout << (end_p-start_p) * 1000.0 / CLOCKS_PER_SEC << " milliseconds (processor)\n" ;
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end_wc-start_wc).count() << " milliseconds (elapsed)\n" ;
}

6 files, 8000000 numbers per file

one vector per file (map file name to vector):
-------------------------
...
370.00 milliseconds (processor)
137 milliseconds (elapsed)

http://coliru.stacked-crooked.com/a/bd91b798f1aa9166
Many thanks JLBorges, it is very useful.
I think I will create a special class for the data type with overloaded operators (for now it will be a map pairing string and vector<double>). Then later on, I may create another class with another approach and compare them.
Thanks once again.
Topic archived. No new replies allowed.