Custom File Extension

points2008 (31)

How Do I create a Custom file extension that opens with my java Program.
For Example : My Java program will compress a .pdf file to some .abc format and I wanted to Design a program to open that .abc file with my Java Program.
Please Suggest me how to do this effectively.

chrisname (7395)

You just have to think about what the file is going to contain and what information programs will need to be able to read it. The first thing to think of is usually a magic number, which goes at the start of the file and is used to identify it. Afterwards, I normally put the major version number of the program that created the file so that older versions of the program can determine whether they support the file format or not (if the version of the program is different to the version of the file, the program might not be able to read it). Then for portability, you should put whether the file is saved in big endian or little endian. Once you've decided what information you need, all that's left is to write the file. That's it. Then you need to reverse the process to read it.

As a simple example, if all you're doing is compressing a PDF file, you might have something like this:

struct abc_header {
    // 4 bytes for the magic number.
    std::uint32_t magic;
    // 4 bytes for the version number of the file.
    std::uint32_t version;
    // 1 bit for the endianness; 0 (big-/high-endian) or 1 (little-/low-endian).
    unsigned endianness : 1;
    // 4 bits for the algorithm used (lets you use up to 15 possible algorithms).
    unsigned algorithm : 4;
    // 3 bits for the compression strength (0-6).
    unsigned strength : 3;
    // 33 bits for a CRC-32 value (lets you check that the header is valid).
    unsigned crc : 33;
    // 23 bits of padding (makes the header exactly 16 bytes long).
    unsigned pad : 23;
};

That gives you a pretty flexible file format to use. There's a magic number to identify the file format, a version number so that the program reading the file knows if it supports the file version, you can have up to 15 different compression algorithms (if you want to support more than one), the compression strength can be varied (lower is faster to compress and decompress but the file will be larger; higher is slower but the file will be smaller) and there's also a CRC value that can be computed to check whether you're looking at a valid file or just random data.

Disch (13742)

I don't recommend reading/writing full structs at a time. Especially those with bitfields. In fact I don't even like reading more than 1 byte at a time as it leaves you subject to system endianness.

chrisname's struct has a bit for endianness... but reading/writing that struct would still be subject and multi-byte variables would have to be corrected. With bitfields, this is particularly nasty because the variables do not necessarily land on a byte boundary ~~(and in fact, in this example, 'crc' doesn't land on a byte boundary... so correcting the byte order would be particularly involved).~~ (I miscounted... chrisname's multi-byte vars -- except for 'pad' -- do in fact land on a byte boundary)

I recommend reading/writing each value individually... and abstracting it so that reads/writes can be broken down into individual bytes. I outline the process in this article:

http://www.cplusplus.com/articles/DzywvCM9/

Last edited on

chrisname (7395)

Disch wrote:
in this example, 'crc' doesn't land on a byte boundary

Does it not? The preceding fields add up to exactly 72 bits = 9 bytes. Does that not make CRC aligned to a byte boundary? Or are you talking about the size of CRC being 33 bits? I suppose you could make the pad bigger so that sizeof(crc) + sizeof(pad) == 8 and treat the whole thing as a std::uint64_t where you ignore bits 33..63.

How would you improve the structure?

Disch (13742)

The preceding fields add up to exactly 72 bits = 9 bytes.

Ah you're right, I miscounted. Apologies.

Or are you talking about the size of CRC being 33 bits?

I wasn't referring to this... but this also confused me. Why 33 and not 32?

How would you improve the structure?

I wouldn't use one. Or at least, I wouldn't do raw I/O on one.

chrisname (7395)

Disch wrote:
I wasn't referring to this... but this also confused me. Why 33 and not 32?

Wikipedia lists all of the CRC polynomials as having an extra bit. I don't know what it's for. That code could be changed to

1
2

std::uint32_t crc;
unsigned pad : 24;

I wouldn't use one

What would you do instead? I did read the article you linked, but I got the impression that you would use a struct.

I'm not recommending OP do this:

constexpr std::size_t length = sizeof(header);
char* buffer = new char[length];
file.read(buffer, length);
abc_header* header = buffer;

which I assume is what you mean by raw I/O.

Disch (13742)

Wikipedia lists all of the CRC polynomials as having an extra bit.

I'm pretty sure that's talking about the polynomial used to compute the hash. Not the generated hash itself.

The actual CRC value you'd use to identify something is only 32 bits.

I'm not recommending OP do this:

Whoops! My mistake again. =)

I had assumed you were recommending that, as i saw no other reason to use bitfields or to make sure the struct was properly padded (both of those do nothing if you are reading/writing individual variables).

Using bitfields unnecessary will just hurt performance, as dealing with a 3-bit variable is "unnatural" and requires internal bitshifting and bitmasking to manipulate.

So yes I would use a class or struct... but I would not read the struct directly, and would rather read individual bytes at a time.

chrisname (7395)

Yeah, using bitfields was probably a mistake to be honest. I was trying to save space, but at the cost of time, it's probably not a worthwhile investment for the sake of about two extra bytes. There's no point, except for the padding, and that's just to give the header a well-defined exact size which is also a multiple of 8 because I'm a pedant. Though I suppose the extra space could also be converted into three one-byte fields or one two-byte field and one single-byte one, maybe as an optional extension, sort of like how PNG has "chunks" that not all programs have to support to be able to read and write PNG files.

Lowest0ne (1536)

Padding also allows you to add more data to the header without changing its size.

Disch (13742)

Its size doesn't matter unless you plan to do read/write directly to the struct.

Topic archived. No new replies allowed.

C++

Forum

Custom File Extension