Disch's tutorial to good binary files

Score: 4.3/5 (264 votes)

****===The right way to do binary file I/O===****

1) Define your building blocks
Binary files are, at their core, nothing more than a series of bytes. This means that anything larger than a byte (read: nearly everything) needs to be defined in terms of bytes. For most basic types this is simple.

C++ offers a few integral types that are commonly used. There's char, short, int and long (among others).

The problem with these types is that their size is not well defined. int might be 8 bytes on one machine, but only 4 bytes on another. The only one that's consistent is char... which is guaranteed to always be 1 byte.

For your files, you'll need to define your own integral types.
Here are some basics:

u8 = unsigned 8-bit (1 byte) (ie: unsigned char)
u16 = unsigned 16-bit (2 bytes) (ie: unsigned short -- usually)
u32 = unsigned 32-bit (4 bytes) (ie: unsigned int -- usually)
s8, s16, s32 = signed version of the above

u8 and s8 are both 1 byte, so they don't really need to be defined. They can just be stored "as is". But for larger types you need to pick an endianness.

Let's go with little endian for this example, which means a 2-byte variable (u16) is going to be stored low byte first, and high byte second. So the value 0x1122 will be seen in the file as 22 11 when the file is examined in a hex editor.

An example way to safely read/write u16's with iostream:

u16 ReadU16(istream& file)
{
  u16 val;
  u8 bytes[2];

  file.read( (char*)bytes, 2 );  // read 2 bytes from the file
  val = bytes[0] | (bytes[1] << 8);  // construct the 16-bit value from those bytes

  return val;
}

void WriteU16(ostream& file, u16 val)
{
  u8 bytes[2];

  // extract the individual bytes from our value
  bytes[0] = (val) & 0xFF;  // low byte
  bytes[1] = (val >> 8) & 0xFF;  // high byte

  // write those bytes to the file
  file.write( (char*)bytes, 2 );
}

u32 would be the same way, but you would break it down and reconstruct it in 4 bytes rather than 2.

2) Define your complex types
Strings are the main one here, so that's what I'll go over.

There are a few ways to store strings.

1) You can say they are fixed width. IE: your strings will be stored with a width of 128 bytes. If the actual string is shorter, the file will be padded. If the actual string is longer, the data written to the file will be truncated (lost).
- advantages: easiest to implement
- cons: inefficient use of file space if you have lots of small strings, strings have a restrictive maximum length.

2) You can use the c-string 'null terminator' to mark the end of the string
- advantages: strings of any length.
- disadvantages: cannot have null characters embedded in your strings. If your strings contain a null character when written, it will cause the file to be loaded incorrectly. Probably the most difficult to implement

3) You can write a u32 specifying the length of the string, then write the string data after it.
- advantages: strings of any length, can contain any characters (even nulls).
- disadvantages: 4 extra bytes for each string makes it ever so slightly less space efficient than approach #2 (but not really).

I tend to prefer option #3. Here's an example of how to reliably read/write strings to a binary file:

string ReadString(istream& file)
{
  u32 len = ReadU32(file);

  char* buffer = new char[len];
  file.read(buffer, len);

  string str( buffer, len );
  delete[] buffer;

  return str;
}

void WriteString(istream& file, string str)
{
  u32 len = str.length();

  WriteU32(file, len);
  file.write( str.c_str(), len );
}

vectors/lists/etc could be handled same way. You start by writing the size as a u32, then you read/write that many individual elements to the file.

3) Define your file format
This is the meat. Now that you have your terms defined, you can construct how you want your file to look. I break out a text editor and outline it on a page that looks something like this:

char[4]      header     "MyFi" - identifies this file as my kind of file
u32          version    1 for this version of the spec

u32          foo        some data
string       bar        some more data
vector<u16>  baz        some more data
...

This outlines how the file will look/behave. Say for example you look at this file in a hex editor and you see this:

1
2

4D 79 46 69 01 00 00 00  06 94 00 00 03 00 00 00
4D 6F 6F 02 00 00 00 EF  BE 0D F0

Since the file format is so clearly defined, just examing this file will tell you exactly what the file contains.

First 4 bytes: 4D 79 46 69 - these are the ascii codes for the string "MyFi", which identifies this file as our kind of file (as opposed to a wav or mp3 file or something, which would have a different header)

Next 4 bytes: 01 00 00 00 - the literal value of 1, indicating this file is 'version 1'. Should you decide to revise this file format later, you can use this version number to support reading of older files.

Next 4 bytes are for our 'foo' data: 06 94 00 00 means that foo==0x9406

After that is a string ('bar'). string starts with 4 bytes to indicate the length: 03 00 00 00 indicating a length of 3. So the next 3 bytes 4D 6F 6F form the ascii data for the string (in this case: "Moo")

After that is our vector ('baz'). Same idea... start with 4 bytes to indicate length: 02 00 00 00, indicating a length of 2
Then there are 2 u16's in the file. The first one is EF BE (0xBEEF), and the second one is 0D F0 (0xF00D)

You'll find that all common binary file formats like .zip, .rar, .mp3, .wav, .bmp, etc, etc are defined this way. It leaves absolutely nothing to chance.

Credits to Disch, who wrote all this, and I just copied it in here because:
(Disch wrote this in the post after the one with the above tutorial)

I really should just make these articles instead of forum posts. Gargle. Anyone want to transcribe this to an article for me? I'm too lazy to do it now.

Well Disch, I transcribed this to an article for you! Hope everyone likes it!