Storing and reading vast sums of numbers..

Hello there, I have been doing some exploring over ways of dealing with large sums of information, hopefully leading to a fun project for myself, but at the minute I'm just concerned over some basic stuff that maybe someone could advise on:

I'll represent the issue using terms familiar to anyone who has ever played a game called Minecraft. In that, you have a world made of blocks. Millions of them. I'm hoping (one day) to build some sort of voxel-based-terrain (lets call it a very far off day then..) program, purely for my own amusement, but I have to come up with a way of storing the vast amounts of data that could be generated effectively.

So I have a certain maximum area of terrain, which we'll say is 12,000 blocks by 48,000 blocks, and can be as high as 300 blocks (a very very very very far off day then). Thats approximately 1.728^11 blocks (172,800 Million).

Obviously i'm not going to load up that many blocks, and I'm not asking you to help me regarding making this insane project, but the basic question I have is this:

What is the best way to store (as in file storage) this information?

I have decided to partition this data into sets (or minecraft chunks) to make it more easy to deal with. So "blocks" have chunk x and y coordinates, then the blocks have x,y and z coordinates within the chunk. At present, all my explorations are based around the size of a single chunk. There is also a last value to store which, we'll call the blocktype, which im giving a range of 0-9999. Now let me show you my test program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// basic file operations
#include <iostream>
#include <fstream>
#include <cmath>

using namespace std;

int chunk_x = rand() % 240;
int chunk_y = rand() % 960;
int pos_x = rand() % 50;
int pos_y = rand() % 50;
int pos_z = rand() % 300;
int blocktype = rand() % 9999;

int main () {
  ofstream myfile;
  myfile.open ("data.txt");
  myfile << chunk_x << chunk_y << pos_x << pos_y << pos_z << blocktype << "\n";
  myfile.close();
  return 0;
}


At present, this generates a file 17bytes in size. Very small, but when you multiply that by 172,800 Million, that comes to over 2,700 GigaBytes of information. Hard disk drives aren't that big! Is there any particular thing I could do to compress, or simply reduce the bytes-per-block? Storing perhaps as hexadecimal information is perhaps an option, but that only allows up to 255 numbers per two digits, so would not really reduce the characters.

Also, not forgetting, the file will also have to be read, with a decimal number string, as per the code above, its easy to tell the program "the first 3 digits are variable ##, the next 3 digits are variable ##" etc, I'm not exactly sure how one would go about reading the data if presented in other number systems...

Any advise or signposting in the right direction is appreciated!
Last edited on
What about something like this?
1
2
3
4
5
6
7
8
9
10
struct block
{	unsigned chunk_x : 8;	//	256
	unsigned chunk_y : 10;	//	1024
	unsigned chunk_z : 4;	//	16
	unsigned pos_x : 6;		//	64
	unsigned pos_y : 6;		//	64
	unsigned pos_z : 4;		//	16
	unsigned block : 13;	//  8K
	unsigned reserved : 13;	
};

This reduces the size of a block to 8 bytes. Note that you want to read and write this as binary data.
1
2
  block b; 
  myfile.write (&b, sizeof(b));


I noticed that you did not have a z coordinate for chunks. I took the liberty of adding one.

Hard disk drives aren't that big!

Hard drives are in terabytes these days.
The size of the entire map isn't extremely relevant, since you'll have to make it sparse one way or another.
Thanks very much abstract that seems to work. At the risk of going off topic, as i've been looking into structs a little bit already, these seem a lot like data types from my old Basic language, here's the example from the help page for data types:

(Note this isnt supposed to be c++ code)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
; Define the CHAIR Type

Type CHAIR
Field X
Field Y
Field HEIGHT
End Type

; Create 100 new chairs using FOR ... NEXT using the collection name of ROOM

For tempx = 1 to 10
For tempy = 1 to 10
room.chair = New Chair
room\x = tempx
room\y = tempy
room\height = Rnd(0,10) ; set a random height 0 to 10
Next
Next

; Move them all over 1 (like the description example)

For room.chair = Each chair
room\x = room\x + 1
Next


In the above example inside the for loops, several chair instances are defined, as the loop resets a new instance is created, which becomes room.chair the old room.chair however, still exists, but is disassociated from the room identifier (so can no longer be found unless a field in the struct definition uniquely identifies it).

Do structs work in exactly the same fashion? or must all structs be assigned to a unique identifier at all times?

Additionally, note the last part of the code, a for loop which scans through all the struct chair instances. How would one accomplish such a thing in C++?
OK I evidently need help defining this struct its telling me it has no idea what I'm doing, which Is sort of true, heres the code at present:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// basic file operations
#include <iostream>
#include <fstream>
#include <cmath>

using namespace std;

struct block
{
unsigned chunk_x : 8;
unsigned chunk_y : 10;
unsigned pos_x : 4;
unsigned pos_y : 6;
unsigned pos_z : 6;
unsigned blocktype : 13;
}

block:b;
{
b.chunk_x = rand % 240;
b.chunk_y = rand % 960;
b.pos_x = rand % 50;
b.pos_y = rand % 50;
b.pos_z = rand % 300;
b.blocktype = rand % 9999;
}

int main () {
  ofstream myfile;
  myfile.open ("data_.txt");

  myfile.write (&b, sizeof(b));

myfile.close();
return 0;
}


And the errors it give me are thus:

1
2
3
4
5
6
7
1>  main.cpp
main.cpp(18): error C2470: 'block' : looks like a function definition, but there is no parameter list; skipping apparent body
main.cpp(18): error C2059: syntax error : ';'
main.cpp(19): error C2447: '{' : missing function header (old-style formal list?)
main.cpp(32): error C2065: 'b' : undeclared identifier
main.cpp(32): error C2065: 'b' : undeclared identifier
main.cpp(32): error C2070: ''unknown-type'': illegal sizeof operand


I know the first five errors are probably to do with how I've defined the instance of block, but I have no clue whats wrong with the sizeof(#) command!
You have a few problems.

Line 16 needs a ;

Lines 18-26: That looks line a function, but line 18 is not a function header.

Lines 20-25: rand is a function call and needs (). BTW, you never call srand() to initialize the random number generator.

Lines 20-25: Your random numbers exceed the number that can be represented in the specified number of bits. This is why I included numbers representing the max value that can be represented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <iostream>
#include <fstream>
#include <cmath>
using namespace std;

struct block
{
unsigned chunk_x : 8;   // 255
unsigned chunk_y : 10;  // 1024
unsigned pos_x : 4;     // 16
unsigned pos_y : 6;     // 64 
unsigned pos_z : 6;     // 64 
unsigned blocktype : 13;  // 8192
};

void generate_block (block & b)
{ b.chunk_x = rand() % 255;
  b.chunk_y = rand() % 1024;
  b.pos_x = rand() % 16;
  b.pos_y = rand() % 64;
  b.pos_z = rand() % 64;
  b.blocktype = rand() % 8192;
}

int main () 
{ block b;
  ofstream myfile;
  myfile.open ("data_.txt");

  generate_block (b);    
  myfile.write ((char *)&b, sizeof(b));
  myfile.close();
  return 0;
}

Thanks for that, that definately just about covers "storing" this data. However now I need to learn about reading it, and the first step there I believe is being able to validate that its reading the variables correctly, and the first step of that, is being able to print the original values to the screen.

From some research, I was able to gather that I need to use printf to do this. I also did a lot of googling but couldnt find a single example on how to printf a simple unsigned integer. However I did look at the help docs, here: http://www.cplusplus.com/reference/cstdio/printf/?kw=printf

Which says the command is used like "int printf ( const char * format, ... );"

To me that means you start with your variable (const char) and then tell it what (format) its in. I gather for unsigned ints I just use a %u (for unsigned). However, when altering the block generating function thus:

1
2
3
4
5
6
7
8
9
void genBlock (block & b) {
	b.chunk_x = rand() % 240;
	printf(b.chunk_x * %u);
	b.chunk_y = rand() % 1024;
	b.pos_x = rand() % 16;
	b.pos_y = rand() % 64;
	b.pos_z = rand() % 64;
	b.blocktype = rand() % 8192;
}


I get an error
'inputoutput.exe': Loaded 'ImageAtBase0x4aa10000', Loading disabled by Include/Exclude setting.
'inputoutput.exe': Unloaded 'ImageAtBase0x4aa10000'


Can anyone advise on how to make it print the variable out in decimal format so its human-readable? If anyone knows any good, detailed (but easy to follow) help on printf, feel free to share it!
Last edited on
1
2
3
4
unsigned int var=15;
const char *cp = "Hello world";
printf("%u\n", var);
printf("%s. The unsigned is %u\n", cp, var);


Output:
15
Hello world. The unsigned is 15

Ahh thanks for that, I think I understand that function a lot better now, its actually format then variable, not the way the help file said, haha.

I have expanded my program (and also added comment lines to aid reading) to start the file reading process, to read the unsigned ints from the file and attempt to process them. Of course, I've hit another error due to not enough simple examples (this time with fread), but after so many lines of added code, i expected more errors than I got!

Here is the main function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
int main () {
	//create named struct
	block b;
	//open output filestream and assign a doc to it.
	ofstream myOutfile;
	myOutfile.open ("data_.txt");
	// loop for however many "blocks" to be written to file
	for ( int x=0; x<10; x++)
	{
		genBlock(b);
		myOutfile.write ((char *)&b, sizeof(b));		
	}	
	//end of loop, close file
	myOutfile.close();
	cout << "========File Written======" << endl;
	system("pause"); //pause before proceeding to load.


	// define myInfile as an input filestream and open the file
	ifstream myInfile;
	myInfile.open ("data_.txt");
	
	loadBlock(b,myInfile);
	myInfile.close();
	return 0;
}


It errors in the "loadblock" function of course, so here is that:
1
2
3
4
5
6
7
8
9
void loadBlock (block & b, istream& mif) {
	// load variables from the mif, and report them to the user for verification
	b.chunk_x = fread(b,unsigned,1,mif);
	b.chunk_y = rand() % 1024;
	b.pos_x = rand() % 16;
	b.pos_y = rand() % 64;
	b.pos_z = rand() % 64;
	b.blocktype = rand() % 8192;
}


Can anyone help with the correct way use fread to just read one unsigned integer from the file, and assign it to the preceeding struct element?

(Also, am i correct in thinking fread will read a given set of data, and then if called again, read the next data from the point it left off?)
You have a couple of problems with loadBlock.

1) You don't want to use fread here. fread operates on a FILE, however, mif is an istream. You want to use mif.read instead.

2) mif.read (and fread for that matter) read a specified number of bytes. What you want to do is read the entire block, just the way you wrote the block. You can't use read (or fread) to read in bit fields individually.

3) You're recalculating the remaining fields in the block (lines 4-8).

1
2
3
4
bool loadBlock (block & b, istream& mif) 
{  mif.read ((char *)&b, sizeof(b)); 
    return mif.good();
}

Note: return type of loadBlock changed to return the status of the read.



Ahh yes apologies, I simply hadn't rejigged the further lines of code (re: point 3)

A curious question arises though, regarded to the above changes to loadBlock as you have redefined it, above. It seems to me now, that it will now be reading in a string of unsinged ints. What is the best way to split them back up into their original 6 variables, to assign to the block struct's values again?
No, loadBlock doesn't read a string of unsigned ints. It reads a series of raw bytes. Forget about the description of the individual fields in the struct. read and write have no visibility to the structure of the block.

The file should be opened in binary mode. read and write will then transfer the specified number of bytes that make up a block. As I defined the struct above, that will be 8 bytes. As long as the alignment of the fields agrees between the time when you wrote the file and when you read the file, everything will line up.



Ok I read what you wrote and have thought on it (thank you for your assistance as well), but I'm just a little confused. I am curious as to how, from...

1
2
3
4
bool loadBlock (block & b, istream& mif) 
{  mif.read ((char *)&b, sizeof(b)); 
    return mif.good();
}


...that C++ is going to know which ints should go in which elements of the block struct?

From what I understand of it, between the save-to-file part of the operation and this new loadBlock function, C++ will remember the block structure defined as "b" and the order its elements were created in, as defined in genBlock() and automatically load them back in from file in the same order?

I have also attached my thusfar complete program, to verify what I have queried above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#include <iostream>
#include <fstream>
#include <cmath>

using namespace std;

struct block
{
unsigned chunk_x : 8;
unsigned chunk_y : 10;
unsigned pos_x : 4;
unsigned pos_y : 6;
unsigned pos_z : 6;
unsigned blocktype : 13;
};

void genBlock (block & b) {
	// assign random variables, and report them to the user for verification
	b.chunk_x = rand() % 240;
	b.chunk_y = rand() % 1024;
	b.pos_x = rand() % 16;
	b.pos_y = rand() % 64;
	b.pos_z = rand() % 64;
	b.blocktype = rand() % 8192;

	cout << "Orig - chunk_x =";
	printf("%u\n", b.chunk_x);
	cout << "Orig - chunk_y =";	
	printf("%u\n", b.chunk_y);
	cout << "Orig - block_x =";
	printf("%u\n", b.pos_x );
	cout << "Orig - block_y =";
	printf("%u\n", b.pos_y);
	cout << "Orig - block_z =";
	printf("%u\n", b.pos_z);
	cout << "Orig - block_TYPE =";
	printf("%u\n", b.blocktype);
	cout << "========End of original======";
}

bool loadBlock (block & b, istream& mif) 
{  mif.read ((char *)&b, sizeof(b)); 
	cout << "NEW - chunk_x =";
	printf("%u\n", b.chunk_x);
	cout << "NEW - chunk_y =";	
	printf("%u\n", b.chunk_y);
	cout << "NEW - block_x =";
	printf("%u\n", b.pos_x );
	cout << "NEW - block_y =";
	printf("%u\n", b.pos_y);
	cout << "NEW - block_z =";
	printf("%u\n", b.pos_z);
	cout << "NEW - block_TYPE =";
	printf("%u\n", b.blocktype);
	cout << "========End of original======" << endl;
    return mif.good();
}

int main () {
	//create named struct
	block b;
	//open output filestream and assign a doc to it.
	ofstream myOutfile;
	myOutfile.open ("data_.txt");
	// loop for however many "blocks" to be written to file
	for ( int x=0; x<1; x++)
	{
		genBlock(b);
		myOutfile.write ((char *)&b, sizeof(b));		
	}	
	//end of loop, close file
	myOutfile.close();
	cout << "========File Written======" << endl;
	system("pause"); //pause before proceeding to load.


	// define myInfile as an input filestream and open the file
	ifstream myInfile;
	myInfile.open ("data_.txt");
	
	loadBlock(b,myInfile);
	myInfile.close();
	system("pause"); //pause before proceeding to quit.
	return 0;
}
that C++ is going to know which ints should go in which elements of the block struct?

It doesn't. Remember I said above "Forget about the description of the individual fields in the struct"

You have to trust that writing out 8 raw bytes then reading back the same 8 raw bytes results in the same values in the structure.

If we look at the way the fields are allocated in block.

1
2
3
4
5
6
7
8
9
10
  +01234567+
0:|aaaaaaaa|
1:|bbbbbbbb|
2:|bbccccdd|
3:|ddddeeee|
4:|eeffffff|
5:|fffffffu|
6:|uuuuuuuu|
7:|uuuuuuuu|
  

a = chunk_x
b = chunk_y
c = pos_x
d = pos_y
e = pos_z
f = blocktype
u = unused

We write out the 8 raw bytes (64 bits), the data is written as shown above.
When we read back 8 raw bytes, we get exactly the same data. Everything lines up as it was.



OK, well i dont like to operate on faith, but I have done some testing (setting all values to 0 between write and read operations) and its definitely working and reading the values from the file to the right places. So this is all good for me now!

The only other question I have is in the way structs are generated from my earlier post:
http://www.cplusplus.com/forum/beginner/140904/#msg744681

Namely:

Obviously the write operation can run over and over as the data is being written out to a file as well, so all data is stored there.

But what happens if I were to run the read-from-file section over and over? Would it just keep overriding the same b/block instance over and over. How would I/ can i also make C++ read the data and make a new block-struct instance for each "block" read from a file?

I do see there is a "new" command in C++ and it IS used with structs, would I use something along the lines of

block * b = new (std::nothrow) block[1];
or even:
block * b = new block;
to create a new instance of the block?

I am of course, trying to find a way I can have the better part of a hundred thousand of these structs in memory, without having to figure out a way to give a unique block-struct identifying name (heretofore always "b")....
"Taking it on faith" was a figure of speech. Hopefully, the diagram showing how the bit fields were allocated convinced you that the struct occupies 8 bytes and that writing and reading the 8 byte struct returns the same data.

Yes, structs are roughly comparable to types in Basic. You have a lot more control over how structs are allocated in C++ allowing you to allocate fields down to the bit level as I showed in the diagram.

Yes, reading into b over and over would result in overwriting b each time.
So you want to read into a different instance of b each time. The classic way to do this is with an array or STL container such as a std::vector.

Let's take the array example first.
1
2
3
4
5
6
7
8
9
10
11
12
 
const int MAX_BLOCKS = 10000;
block blocks[MAX_BLOCKS];  // Allocate an array of blocks
int n = 0;
//
  while (loadBlock (blocks[n], mif)
  {  n++;  
      if (n > MAX_BLOCKS)
      { cout << "Too many blocks in data file" << endl;
         exit (1);
     }
  }


Using a vector is generally preferred since there is no predetermined lmit on the number of entries in a vector and the vector tracks the number of entries automatically. You are of course restrained by physical memory.
However, 100,000 blocks * 8 bytes = 800KB which should not be an issue.
1
2
3
4
5
  vector<block> vblocks;
  block b;
//
  while (loadBlock (b, mif)
    vblocks.push_back(b);   // push a copy of b onto the vector 


Other containers such as std::map or std::multimap might also be appropriate since a map or multimap would give you fast access to a chunck or block based on its key (chunk_x, chunk_y). I'm not going to give an example of this here as I think this might be beyond where you are right now. If you're interested in this approach, post back and I'll put together an example.

Yes, you can allocate individual blocks with new. Problem with this is that new allocates memory from the heap and returns a pointer to the block, you then have to keep track of the pointers in an array or some kind of container and deallocate the objects when you're done to avoid a memory leak. It's simpler to push a copy of the object onto the vector as I showed you above.





Topic archived. No new replies allowed.