Make class with constant size. Help!

Pages: 123
I am learning to write and to read from a binary file. I want all objects to be the same size which means that the array size of the class has to be specified at the declaration so that it always stays the same, no matter how much data is stored in the arrays. The problem is that I get an error and it does not allow me to do this. See code below:

1
2
3
4
5
6
7
8
9
10
11
12
13
#ifndef PERSON_H
#define PERSON_H

class Person {
	public:
		Person(){};
		char name[20];
		char adress[30];
	protected:
	private:
};

#endif 


I always want the char arrays "name" and "adress" to be the same size in all objects which are created from this class as you can see. What alternatives are there to solve this?

The reason for why I want all objects to be the same size, is that it will allow me to read the different groups of data from each Person object in the binary file, by counting how much data each Person object takes, therefore all objects have to be the same size so that the data group size/the gathered data for each object, that I read from the binary file, is constant.

Last edited on
IMO, you should never ever write classes (or even structs) in full to a file directly.

The reason for this is because the compiler might modify the layout of the class/struct to make it more efficient (a typical example of this is padding elements to 4-byte boundaries), which could make files written in one version of the program load improperly when read in another version of the program (if the compiler decides to rearrange the struct between versions).

You should only write basic data types (int, char, short, etc) directly. But even that is "iffy" because you're still subject to varying type sizes (sizeof(int) can vary on different machines) and system endianness (0x12345678 might get read back as 0x78563412).

To make it fullproof, you should only write individual bytes (chars are also OK) to the file in the order you want them written. This means that writing struct-at-a-time is not suitable.
closed account (zb0S216C)
@Disch: Is serialisation an option? Or does that have the same problems with alignment and endian-ness?

Wazzak
Could you describe what you mean with "in full directly to a file."
What is a "padding element"?

Do you mean with "in full directly to a file.", that I define the constructor direcltly in the class instaed of defining it in a Person.cpp file or what? If so, I did it only because I wanted to test this quick and did not know that it was so bad.

Still, how do I solve the problem with the arrays, if I want char arrays that always are the same size in all objects?

Please do also answer the questions above about your reply Disch, I do not understand all terms.

I find it strange that you say that it is unsuitable to write entire structs at the time, according to my book, it is ok/does not mention risks and even the aurhor uses it in an example where he uses the functions write() and read(). Was this in the old days? The book is from 2007. Were there totally different techniques then and should i get a new book?

Thanks a lot for your help!

Last edited on
Framework wrote:
Is serialisation an option? Or does that have the same problems with alignment and endian-ness?


Serialization libs often store each element individually as text to avoid these issues.... for this very reason. But it depends entirely on the lib.

Any serialization lib that does not address these issues is not worth using... as these are the most basic of the basics for storing binary data consistently.

Zerpent wrote:
Could you describe what you mean with "in full directly to a file."


Something like this is "bad":
1
2
3
4
5
6
7
8
9
10
11
struct foo
{
  int bar;
  int baz;
};

//...

foo obj;
fwrite( &obj, 1, sizeof(foo), myfile );  // cstdio ... or...
myfile.write( (char*)&obj, sizeof(foo) );  // iostream 


The better way would be to write obj.bar and obj.baz individually. And the best way would be to break obj.bar and obj.baz into individual bytes and write those bytes separately.

Zerpent wrote:
What is a "padding element"?


Basic example:

1
2
3
4
5
6
7
8
9
10
11
struct foo
{
  char a;
  int b;
};

foo obj;
cout << sizeof(obj.a);  // output:  1
cout << sizeof(obj.b);  // likely output:  4
cout << sizeof(obj);    // since sizeof(a)+sizeof(b) is 5, you'd expect this to be 5
       //  but it probably is actually 8 


The reason for that oddity is because compilers will put "gaps" or "padding" between elements to that they land on neatly aligned memory boundaries. This allows memory accesses to be much faster at the cost of a few extra bytes of RAM. Padding might also be necessary by the architecture... since there may be some situations where data must be aligned on a certain boundary or it cannot be accessed.

Compilers often let you disabling this padding with a #pragma directive, but that is not portable and potentially gives you a performance hit.

Do you mean with "in full directly to a file.", that I define the constructor direcltly in the class instaed of defining it in a Person.cpp file or what? If so, I did it only because I wanted to test this quick and did not know that it was so bad.


No I mean what I meant in my first example.

If this is a class with member functions, constructors, etc... then writing in the above style is even worse because there might be other "hidden" data members that the compiler uses. If the class is polymorphic, for example (ie: has virtual functions), there's an additional "vtable" that is part of the class which you absolutely should not mess with or you risk causing your program to crash and/or introducing memory leaks.

Still, how do I solve the problem with the arrays, if I want char arrays that always are the same size in all objects?


If you make your char array have 20 or 30 elements as you did in your original post... then it will always have 20 or 30 elements for all objects (they'll always be the same size for all objects). So it's a non-issue.

I find it strange that you say that it is unsuitable to write entire structs at the time, according to my book, it is ok/does not mention risks and even the aurhor uses it in an example where he uses the functions write() and read(). Was this in the old days?


It's a quick-'n'-dirty way to do it, and will work in most applications (provided you follow some rules that I'm sure the book didn't mention). But it does come with some caveats.

The rule to Follow:
The struct must be a POD type. This means:
-a) It cannot have any members that are pointers
-b) It cannot have any virtual functions
-c) It cannot derive from any parent class that is not a POD type
-d) It cannot have any members that are not POD types (stl containers like strings, lists, etc are NOT POD types, and trying to read/write them in this way will fail catastrophically)

Caveats and things to be mindful of when doing this:

1) Say you build your program in version X of compiler X, save a file, then rebuild your program in version Y of compiler Y and try to load it. If X and Y are the same, then you will be OK. If they're different, you might not be. Since compiler Y might arrange the struct differently in memory than compiler X did... reading it "raw" like this might mean it gets read incorrectly. This means your actual file format is dependent on the version of the compiler you're using... which makes it very flakey.

2) Once you save a file... you cannot modify your struct ever again or you risk invalidating all files saved with the old format.

3) POD TYPES. I can't stress this point enough. C++ programs tend to use non-POD types everywhere. strings, vectors, etc are all extremely common. But if you read/write structs this way, you simply can't use them. Which means you are crippling yourself .. and all you get out of it is the ability to do half-assed file I/O. Much better to do I/O properly so you don't risk having these issues, and you can use whatever types you want.

Your book is telling you to use char arrays (ugh!) instead of strings because it knows you can't do file I/O with strings this way. But that sucks because using char arrays is more error prone, more difficult to use, more prone to buffer overflows, etc, etc.

4) System endianness. Most home PC systems now are little endian, which means the value 0x12345678 gets written to the file as 78 56 34 12 (low byte first). If you save this file on a little endian system, then load the file on a big endian system, it will get loaded incorrectly (it will load as 0x78563412 -- high byte first). So doing this destroys the portability of your data.

The book is from 2007. Were there totally different techniques then and should i get a new book?


I can't speak for the whole book, but the section on binary file I/O certainly seems half-assed if it isn't telling you the dangers with this approach.

Honestly I've seen enough poorly written (and just flat out incorrect) C++ books that this kind of thing doesn't surprise me anymore. I wouldn't be able to recommend any better books.
Last edited on
DISCLAIMER:

I recognize that I am being quite anal. For all intents and purposes, doing what the book is telling you to do will work just fine for quick one-off programs.

But if you are making a serious program that you plan on distributing, and want to be consistent and portable.... then the book's approach is not sufficient at all (at least not IMO -- not by my personal standards).


****===The right way to do binary file I/O===****

1) Define your building blocks

Binary files are, at their core, nothing more than a series of bytes. This means that anything larger than a byte (read: nearly everything) needs to be defined in terms of bytes. For most basic types this is simple.

C++ offers a few integral types that are commonly used. There's char, short, int and long (among others).

The problem with these types is that their size is not well defined. int might be 8 bytes on one machine, but only 4 bytes on another. The only one that's consistent is char... which is guaranteed to always be 1 byte.

For your files, you'll need to define your own integral types.
Here are some basics:

u8 = unsigned 8-bit (1 byte) (ie: unsigned char)
u16 = unsigned 16-bit (2 bytes) (ie: unsigned short -- usually)
u32 = unsigned 32-bit (4 bytes) (ie: unsigned int -- usually)
s8, s16, s32 = signed version of the above

u8 and s8 are both 1 byte, so they don't really need to be defined. They can just be stored "as is". But for larger types you need to pick an endianness.

Let's go with little endian for this example, which means a 2-byte variable (u16) is going to be stored low byte first, and high byte second. So the value 0x1122 will be seen in the file as 22 11 when the file is examined in a hex editor.

An example way to safely read/write u16's with iostream:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
u16 ReadU16(istream& file)
{
  u16 val;
  u8 bytes[2];

  file.read( (char*)bytes, 2 );  // read 2 bytes from the file
  val = bytes[0] | (bytes[1] << 8);  // construct the 16-bit value from those bytes

  return val;
}

void WriteU16(ostream& file, u16 val)
{
  u8 bytes[2];

  // extract the individual bytes from our value
  bytes[0] = (val) & 0xFF;  // low byte
  bytes[1] = (val >> 8) & 0xFF;  // high byte

  // write those bytes to the file
  file.write( (char*)bytes, 2 );
}


u32 would be the same way, but you would break it down and reconstruct it in 4 bytes rather than 2.


2) Define your complex types

Strings are the main one here, so that's what I'll go over.

There are a few ways to store strings.

1) You can say they are fixed width. IE: your strings will be stored with a width of 128 bytes. If the actual string is shorter, the file will be padded. If the actual string is longer, the data written to the file will be truncated (lost).
- advantages: easiest to implement
- cons: inefficient use of file space if you have lots of small strings, strings have a restrictive maximum length.

2) You can use the c-string 'null terminator' to mark the end of the string
- advantages: strings of any length.
- disadvantages: cannot have null characters embedded in your strings. If your strings contain a null character when written, it will cause the file to be loaded incorrectly. Probably the most difficult to implement

3) You can write a u32 specifying the length of the string, then write the string data after it.
- advantages: strings of any length, can contain any characters (even nulls).
- disadvantages: 4 extra bytes for each string makes it ever so slightly less space efficient than approach #2 (but not really).


I tend to prefer option #3. Here's an example of how to reliably read/write strings to a binary file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
string ReadString(istream& file)
{
  u32 len = ReadU32(file);

  char* buffer = new char[len];
  file.read(buffer, len);

  string str( buffer, len );
  delete[] buffer;

  return str;
}

void WriteString(istream& file, string str)
{
  u32 len = str.length();

  WriteU32(file, len);
  file.write( str.c_str(), len );
}



vectors/lists/etc could be handled same way. You start by writing the size as a u32, then you read/write that many individual elements to the file.


3) Define your file format

This is the meat. Now that you have your terms defined, you can construct how you want your file to look. I break out a text editor and outline it on a page that looks something like this:

1
2
3
4
5
6
7
char[4]      header     "MyFi" - identifies this file as my kind of file
u32          version    1 for this version of the spec

u32          foo        some data
string       bar        some more data
vector<u16>  baz        some more data
...


This outlines how the file will look/behave. Say for example you look at this file in a hex editor and you see this:

1
2
4D 79 46 69 01 00 00 00  06 94 00 00 03 00 00 00
4D 6F 6F 02 00 00 00 EF  BE 0D F0


Since the file format is so clearly defined, just examing this file will tell you exactly what the file contains.

First 4 bytes: 4D 79 46 69 - these are the ascii codes for the string "MyFi", which identifies this file as our kind of file (as opposed to a wav or mp3 file or something, which would have a different header)

Next 4 bytes: 01 00 00 00 - the literal value of 1, indicating this file is 'version 1'. Should you decide to revise this file format later, you can use this version number to support reading of older files.

Next 4 bytes are for our 'foo' data: 06 94 00 00 means that foo==0x9406

After that is a string ('bar'). string starts with 4 bytes to indicate the length: 03 00 00 00 indicating a length of 3. So the next 3 bytes 4D 6F 6F form the ascii data for the string (in this case: "Moo")

After that is our vector ('baz'). Same idea... start with 4 bytes to indicate length: 02 00 00 00, indicating a length of 2
Then there are 2 u16's in the file. The first one is EF BE (0xBEEF), and the second one is 0D F0 (0xF00D)




You'll find that all common binary file formats like .zip, .rar, .mp3, .wav, .bmp, etc, etc are defined this way. It leaves absolutely nothing to chance.
Last edited on
I really should just make these articles instead of forum posts. Gargle. Anyone want to transcribe this to an article for me? I'm too lazy to do it now.
closed account (j2NvC542)
Thank you for those amazing explanations, Disch!
Last edited on
Yeah thank you Disch for giving me an article as answer and for taking the time to answer all my questions. You are awesome.
I am now trying to do it myself according to what Disch shows.
When I try to define my own types like: -

1
2
3
u8 = unsigned 8-bit;
u16 = unsigned 16-bit;
u32 = unsigned 32-bit;


it does not accept that syntax. Was it the syntax that you showed above?
Last edited on
closed account (j2NvC542)
Yeah, I'd too like to know what the types, that are always the same size, are.
1
2
3
4
5
6
7
8
#include <cstdint> // C++11, but many compilers have it as <stdint.h> before that.  Try both

typedef uint8_t   u8;
typedef uint16_t  u16;
typedef uint32_t  u32;
typedef int8_t    s8;
typedef int16_t   s16;
typedef int32_t   s32;


Or you can just use uint8_t, etc directly. Though I've personally grown more familiar with 'u8', etc.

EDIT: mixed up stdint with SDL's int types. whoops.
Last edited on
The fixed-width types are called uint8_t, int8_t, uint16_t, int16_t, uint32_t, int32_t, uint64_t and int64_t. They are defined in <cstdint> in C++ since 2011 and in <stdint.h> in C since 1999.

EDIT: Too slow!
Okay, for added value: if your system supplies the POSIX host-to-network byte order functions (htons(), htonl(), etc), you could use them to ensure that the byte order of the output is consistent. Here's what they do: http://pubs.opengroup.org/onlinepubs/9699919799/functions/htonl.html

And, again, consider using text instead of inventing a new binary file format - XML is a popular choice (e.g. boost.serialization generates it)
Last edited on
Ok thank you a lot, I will try again with this.
How do hte following syntaxes work?:

 
val = bytes[0] | (bytes[1] << 8);

What does the | and << do in this case and why an 8 after the <<?

 
bytes[0] = (val) & 0xFF;

?

 
bytes[1] = (val >> 8) & 0xFF;

?




The only one that's consistent is char... which is guaranteed to always be 1 byte.
Oh, ¿but what is a byte? xP

¿Would u16 {Read,Write}U16(istream& file); slow down array serialization?


@OP: those are bit operations
| is bitor
<< is shift (to the left)

What they are doing is "concatenating" the bytes to form an integer (taking into account endianness).
Last edited on
ne555 wrote:
¿but what is a byte?

I'll bite:

The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined.


implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.

number of bits for smallest object that is not a bit-field (byte)
CHAR_BIT 8


So it's a sequence of at least 8 bits.
Could anyone please explain how the syntax works in my examples from Disch's reply above in my last reply? I can make it work, but I want to understand the syntax and the logic behind it also rather than just copy-pasting code.
Do you know how binary operators work?

I explain | and & here:
http://www.cplusplus.com/forum/beginner/72705/#msg387899

>> and << simply "shift" the bits right or left. So 0xFF00 >> 8 is equal to 0x00FF because all bits have been shifted to the right 8 positions.

1 byte = 8 bits, which i why I'm shifting by 8.

Reading:
 
val = bytes[0] | (bytes[1] << 8); 


We know that bytes[0] is the low byte (0xLL). bytes[1] is the high byte (0xHH)

By doing bytes[1] << 8 we are shifting the high byte left 8 places, effectively making it 0xHH00
We then OR that value with the low byte to "combine" them, giving us:
0xHHLL

Writing:
1
2
bytes[0] = (val) & 0xFF; 
bytes[1] = (val >> 8) & 0xFF


This is the inverse operation. We know that 'val' is our 16-bit value in the form 0xHHLL.

val & 0xFF masks off all but the low 8 bits, effectively changing 0xHHLL to 0x00LL -- which is our desired low byte.

val >> 8 shifts the entire value right 8 positions, changing 0xHHLL to 0x00HH. The extra & 0xFF ensures that we are only taking 8 bits (it probably isn't needed in this case, but I put it there anyway because it doesn't hurt).

The end result is that bytes[0] gets set to 0xLL
and bytes[1] gets set to 0xHH, which is what we want.


¿Would u16 {Read,Write}U16(istream& file); slow down array serialization?


Possibly. To optimize that you could write functions which buffer arrays in an endian safe manner and then write the entire buffer at once -- but I find this is not really useful because reads/writes are often buffered in the file i/o implementation anyway.
Last edited on
Just before this reply I watched a vid where they explain it so I know that now, but there they did not use 0xFF but 0xF0 with the operators which was a bit (:D) confusing.

Thanks a lot Disch, I will need to go through this a few times to get everything but now I got all the info. I need.
Last edited on
Pages: 123