Use less RAM for vector

Pages: 12
I have a text file that's about 3.2GB that I stream into a vector, and if there are too many programs running, the program crashes and throws a bad_alloc. I understand that it's because more RAM is required, which when I close most of the programs it runs, but is there a way to reduce the amount of RAM that is needed to load the data into the vector?
you can read a smaller part of the file instead of the whole thing, or read the file into multiple vectors of smaller parts that you stream together.

if the trouble is actually out of memory, you need solution 1 (read less of the file and work on it in chunks).

if the trouble is fragmented memory (lack of 3gb all in one block available, but you have maybe 5gb available in scattered bits all over your ram) you can try solution 2.

solution 3 is viable too:
attempt to allocate memory.
if it fails, tell user you can't open the file and to go buy more ram, then exit the program.

solution 4 … you can get really wonky and use smaller characters. EG if this is a simple text file in American English, you can probably store it in 5-6 bits, getting significant savings over 3gb in exchange for an inefficiency. (Highly not recommended). 5 bits is not quite 1/2 sized.
Last edited on
@jonnin, I don't think virtual memory can be fragmented by other programs like that, although the program itself could fragment it's own memory.

@lumbeezi, What exactly are you doing with all that text in memory at once?
@jonnin a vector's memory is guaranteed to be contiguous, but you could still be right in that if the structure lumbeezi is created has dynamic memory inside of it (ex: every element has an std::string), then there could be memory fragmentation causing more problems. Regardless, 3.2 GB is a lot of RAM for a text file :)

lumbeezi, why don't you tell us more information? You told us the size of the text file, but nothing else. Our suggestions are just guess work.
- Give example of what's in your text file.
- How many elements are in the vector?
- What is the data type of each element?
- What do you do with the text/data?

My guess is the same as the others: You might not need to load 3.2 GB all at once. Another possibility is changing the format of the text file. For example, "60000" as a string is 40 bits in modern encodings, but 60000 can be stored in 16 bits as a number in binary. 2.5 times as small.
Last edited on
@Ganado, I'm sure jonnin knows that vectors are contiguous! My point is that with virtual memory the address space that each program sees is unaffected by other programs. But there's also the swap file, so I'm not sure why other programs would stop it from running. As far as I know, they should just slow it down by forcing it to use the disk.

I wonder why the OP didn't read the text into a string if it's char data? Not that it would help, but still. And depending on how he's allocating the space in the vector (or even a string) it could be using a lot more memory that the 3.2GB. I'm not sure if that matters.

If he really needs to have it all in memory at once, a variation of jonnin's compression idea is to use huffman coding based on the frequency of the letters in English (or whatever the language is). https://en.wikipedia.org/wiki/Huffman_coding
Last edited on
Yes, I'm sure he does. Didn't mean to insult anyone's intelligence/knowledge there.
As far as compression, the problem is wouldn't you still need to uncompress it to actually use it? I could be wrong, but no point in my guessing anymore about OP's specifics.
Yeah, it's just a possibility. But you certainly don't need to decompress the whole thing to use it. With huffman coding you can "decompress" a letter at a time as you go along.
Compressing the memory contents kind of defeats the purpose of loading the data into memory, as decompression will surely take longer than reading from storage, plus you have to give up on random access, to some extent. It's a lot of added complexity for little to no gain.
A more practical alternative, if it's really necessary for the entire contents to be in the address space at once, is to map the file to memory. Let the OS handle what needs to stay in memory and what doesn't.
Did not expect all the replies. I appreciate the help.

It's a csv/txt tile I use for data analysis. I need access to every element in the data set and it takes a lot of memory to load the file so I wasn't sure if there was a more efficient way of doing it.

I have a vector<Struct> that I stream the file into. The struct has a variable for each column of the txt file and also extra elements that are filled in when some function run calculations on the data.

Here's an excerpt:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <vector>
#include <algorithm>
#include <fstream>
#include <cmath>
#include <numeric>
#include <string>
#include <iomanip>

using namespace std;

class DataSet
{
public:
    struct Data
    {
        double date, time, mil, first, second, dot, amount;
        unsigned int uW, dW;
        unsigned int sell, buy;
    };
};

int main()
{
    return 0;
}



I have about 72,000,000 elements in the vector once I stream it all in. I actually have to run the program on Ubuntu because Windows throws a bad_alloc every single time.

Here's a few lines from the file:

20171031 060000 0020000;25411;25;1132.75;2
20171031 060000 3360000;1023.3;259.5;569.7;3
20171031 060000 3360000;69.75;26.95;29.7;1
20171031 060000 8050000;241;2699.2;1469.15;7
Last edited on
An efficient way to do this is to use a "memory map" (in Win32 parlance, sometimes known as a "File map"). In effect, you point to the file and say you want to read it as if it were all loaded in memory. The memory map function, behind the scenes, takes care of loading pieces of it as necessary. To you, it simply looks like it's all there, ready to be used; the function and the OS takes care of actually loading it as needed behind the scenes, with some clever cacheing as needed. Done right, it can be really fast; fast and simple! Perfect!

This would mean that you'd have to think differently about how to actually process the data; creating 72 million instances of your class up front is the problem you're now trying to avoid.
Last edited on
struct Data
{
double date, time, mil, first, second, dot, amount;
unsigned int uW, dW;
unsigned int sell, buy;
};

Ill sort of repeat what I said earlier too...
do all of those NEED to be doubles, or will a float do for any of them?
do all of those NEED to be int, or will a short or char do?
is the struct alignment wasting a ton of space that can be cleaned up with compiler settings?

72 million of those is less than impressive.
lets see .. 7 * 8 bytes of doubles, 6 * 8 bytes of int. 13*8 is 104 bytes a pop.
call it 7.5 billion bytes. 7GB of memory that does not have to be in a single block if it won't fit (it would be ideal, but its not required). This isn't even 1/2 the memory on even an outdated/poor machine of 16gb. Its not even interesting to a typical 64 gb machine, which is the current standard.

Ill ask again... how much memory do you have? I am 100% sure I could allocate a vector of those on my home machine and the fire up 3 games on top of it.






I have a vector<Struct> that I stream the file into. The struct has a variable for each column of the txt file and also extra elements that are filled in when some function run calculations on the data.

Depending on what you're actually trying to accomplish it may not be necessary to load the entire file into memory at one time. My reading of the above quote gives me the impression that you may be able to load a single record into a single structure instance, call your calculation function, then just "save" the results of the calculation, repeat until all the "results" are calculated.


Its not even interesting to a typical 64 gb machine, which is the current standard.

64 GB is NOT the standard for the average layman. I only know an artist or two that use 64 GB (Photoshop and other video editing takes up lots of memory). I only recently got a 16 GB desktop. Most laptops have 8 GB.

Judging by the small snippet of data, it looks like you at least partially have repeated information (the first two columns). If that data is more than the size of a pointer (64-bits on 64-bit programs), then it could save you some space to make a pointer to one allocated instance of that data, instead of copying it each time. But this won't save you magnitudes, by the looks of it.

have about 72,000,000 elements in the vector once I stream it all in

I'm repeating what jlb said (who I agree with), but do you need to stream it all in? Or can you do whatever necessary calculations with just one piece of data at a time? Or, send the calculations to another file instead of putting it all in RAM?
Last edited on
I guess I am used to hard gamers, most have 32 or 64. My mistake.


16 GiB of RAM is currently the standard for gaming PCs. Hardly outdated. 64 GiB is excessive for the vast majority of people.
Aye. Even the top Coffee Lake desktop CPU (8086K) has memory max of 64 GiB, (which probably means 4*16 with two modules per channel). However, most vendors and users are likely to settle with two module (i.e. one per channel) and merely ponder whether it is necessary to spend money on 2*8, when they only recently advanced from 2*2 to 2*4.

Scalable Xeon Platinums can manage (theoretically) 1.5 TiB of RAM, but do the boards that can hosts 8 of those beasts actually have room for 12 TiB of RAM modules?

Traditional way to cope with insufficient physical RAM has been the swap. Ain't it sweet when a 96 GiB machine starts to swap in order to load all data to "memory"?
Hmm, just my (year old or so?) video card has
Display Memory: 24498 MB (1070 series).

Guess I am a little out of touch.

Still wonder what the OP has in his box though.... that would still be nice to know...



Well, a video card is a tangential subject, but wow, I've never heard of a GTX 1070 having ~24 GB, most have closer to 8192 MB (8 GB).
https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-1070/specifications
I'm pretty sure only Quadros and Teslas have memory sizes in those ranges. According to Wikipedia the consumer-grade GPU with the most VRAM they make is the TITAN X, with 12 GiB. In some cases you can find GPUs with unofficial memory configurations in certain markets (e.g. a GTX 1060 6 GB with 5 GB of VRAM: https://www.youtube.com/watch?v=PUIxyLRnzDc )

What you're looking at is almost certainly the memory that's being shared between the host and the GPU.
Probably 8 GB dedicated VRAM + half of host RAM as "shared".

@lumbeezl:
Your data sample seems to contain
timestamp ; float ; float ; float ; integer

About 5 values. Yet, your struct holds 11 values. Surely the timestamp is not 7 values by itself?
Besides, it is not exactly clear, which of the 11 are from input, and which are computed.

What are the operations that require all data to remain in memory?
Last edited on
Pages: 12