Write to file a large number of integers

Hello,
I need to write in a file a big number of integers, an inverted index that maps the line number to the position of the integer indicated in the actual value of that line.

The problem is I can have up to some millions of lines to write and the file is quite big even after compression.

Right now I am using something like this:

1
2
3
4
5
6
std::ofstream lineFile;
lineFile.open (outfname + ".map");
for (auto br : list) 
    lineFile << br.line << "\n";

lineFile.close();


and then invoke plzip to compress. Is there anyway I can represent this file in a more compact way?

I could use 4bit to represent each 0-9 symbols + the \n and maybe the EOF too. Is that a good idea?

Thanks you
Last edited on
can you use a binary file?

text is terrible. 123, 2 bytes becomes 3. For larger values, this grows at an alarming rate -- an 8 byte integer takes up many, many bytes of text.

also, consider bzip2 if you are on unix. If you are on windows, consider a commandline 7zip command.

Also, don't write line by line to a file. Write all the data to a big buffer, then write the buffer all at once. It will run much faster. If buffer gets too large (this takes a very, very large thing on a modern PC, several GB to begin to fill up memory on even a sorry computer) you can split and flush chunks.

binary is fixed width, you don't need spaces, end of lines, or other junk. All that can go away.
Last edited on
Yes I was thinking about a binary file. But I need 10 symbols for 0-9 numbers and 1 for a delimiter (the \n I was walking about but it's just a delimiter to parse back the numbers).

I am on Ubuntu so yes, I will use bzip2, plzip, gzip to compare.

About the binary file, do you mean using a BCD i.e.:
0=0000,
1=0001,
....
9=1001,
delimiter=1111;

For each line to write I should get the integer, convert it to string and iterate each character and print to file the corrispective code. Is that ok, or is there a smarter solution?

I will look on data buffers, good tip.

Thanks
No, I mean a C++ binary file.

ofstream ofs;
ofs.open(filename, ifs::binary); //I think its ifs, maybe its ofs, I am not sure right this sec

...
ofs.write(buffer, num_bytes);

this will write the data "bytewise". That is, a 32 bit integer will take 4 bytes, no matter what is in it. You don't need to parse anything. You dont need a delimiter. If you looked at it in hex, it would look like this

1234ab3f9999 where the first number is hex 1234, the second number is ab3f, and the third is 9999.

whene you read it in, it reads 4 bytes back into a 32 bit int and if you print it, the first one will print 4660, which is what 1234 hex is in human base 10 format. 9999 is 39321 in human format, etc. The computer will do all the work for you here.

DO NOT convert to strings. Not here. C++ can do all the work for you. You litterally need a (pointer, array, vector, std::array, etc construct) of your integers, all of them, and then you can just read and write that directly to the file without doing anything else. The whole operation should take fractions of a second, even for millions.


Floats / doubles ... everything in your machine is stored as bytes. You won;t be able to easily see what the value of a hex represented double is (the conversion nontrivial) but its still just bytes and you can still read/write them this way. You can do this with *everything* but if you do it with classes, you have to be careful to write the DATA hidden in any pointers or the like, if you just dump the class to a file without thinking you will get the value of the POINTER which is useless of course. Generally for complex objects like classes you need to write a method that formats the data for file/network I/O to and from.

I can help you if this isnt making sense. You should NOT have to "iterate" or "parse" anything here. If you think you do, you are not understanding it yet.

If this is all new to you, I suggest you make a tiny toy program and play with the binary files for a few min. Cook up a struct with the data you want to write and dump 10 of them into a file, take a look at it with a hex editor, read it back in and print the data, etc. Get a feel for it on a small scale before you go big.

If bzip turns out to do best, remember -9 with it. Most any other setting, it just runs slower to get worse compression than other tools. On -9, for some types of data, it does far, far better. It blows xml and structured text data files away. Its a try to see tool.. sometimes, its better, sometimes, not really -- it would likely supercompress a text file of numbers but would probably not do as well with the binary. /shrug

Last edited on
Yeah I get it, all good. The only shady part for me is how to use the buffer in C++, I will look for examples and try.
My idea was to implement an encoding ad-hoc so I would be using the minimum number of bits though the results wouldn't justify the effort.

I will try the -9 flag but I am mostly compressing byte files

Thank you!
depends on the data size, like I said. the 10 cent old-school solution would just be..

struct thing
{
int num;
double data; //whatever your data is..
};

thing * tp = new thing[5000000]; //5 mil enough?

for(...)
tp[index].num = something;
tp[index].data = other;


file.write(char*(tp), sizeof(thing)*5000000);

and this will fit in the memory of most machines. Its not even a gigabyte yet.


if it gets too big for your needs, you just do the write every so often, say every 1 million, and start over filling the buffer at 0 again.

you can do all this with a vector, of course. Whatever you like. Just be sure to reserve enough for all the records the first time, or you will have a mega slowdown when it tries to move all that around.
Last edited on
Also, don't write line by line to a file. Write all the data to a big buffer, then write the buffer all at once.
iostreams are already buffered. As long as you don't flush the data with endl, you should be okay.
That is a good point...
Topic archived. No new replies allowed.