Searialize object to file system with char or wchar_t?

I'm in the process of udating a library (DLL) to support unicode. One of the classes in the library gets searialized to the file system (struct below).

1
2
3
4
5
6
7
8
9
10
11
12
// Represents a single character with-in a font.
struct Glyph
{
#if defined(_UNICODE)
    wchar_t Character;
#else
    char Character;
#endif
    float CharacterSpacing;
    float OffsetX;
    float OffsetY;
};


We are currently using std::ofstream to write a Glyph object to a file, but now we have preprocessor directives in place to switch between std::ofstream and std::wofstream respectively. The code to write a Glyph object looks similar to what's below.

1
2
3
4
5
6
7
8
9
10
Glyph glyph;
// fill out glyph data;
#if defined(_UNICODE)
    std::wofstream fOut(L"glyphData.dat", std::ios::binary | std::ios::out);
    fOut.write(reinterpret_cast<wchar_t*>(&glyph), sizeof(Glyph));
#else
    std::ofstream fOut("glyphData.dat", std::ios::binary | std::ios::out);
    fOut.write(reinterpret_cast<char*>(&glyph), sizeof(Glyph));
#endif
    fOut.flush();


Now my question is, should I just be writing out the Character field via std::wofstream and std::ofstream separately and just finish off with the remaining fields with std::ofstream or write the whole object out in one write? Does wchar_t really matter with numeric data (i.e. float, int, double)? After running a few tests and writing out glyph data with std::wofstream and reading it back in with std::wifstream, my data seems all messed up. Especially the numeric data. Anytime I've needed to worry about Unicode was just with straight text, not writing an entire object out. Thanks!
Last edited on
This is a weird subject that is fraught with a lot of misunderstanding. Please forgive me for not directly answering your question, but instead giving you a broader picture of the problem, as it appears to me that you do not fully understand it.

People tend to think that if they change all their character types to wchar_t, then their program will magically support Unicode. That's not quite how it works.

'wchar_t' is not Unicode. It's a wide character. This is a common misconception caused by the way WinAPI treats the somewhat poorly named 'UNICODE' macro to switch between wchar_t and char types for the TCHAR typedef.

So let's start by defining what Unicode actually is:

Unicode is a system which maps glyphs to a unique numerical identifier (aka, a "code point"). That's it. You can think of it as a giant lookup table... where you give it a code point, and it gives you back a glyph -- or vice versa.

Examples:


glyph = Unicode codepoint
--------------------
a =  U+0061
ɻ =  U+027B
স =  U+09B8
𠀱 = U+20031


There is conceptually no limit to the number of code points that can exist. Though realistically, any legal codepoint can exist within a 32-bit integer. However... a 16-bit integer (like a wchar_t on windows), is too small as some codepoints are above U+FFFF (even though they are extraordinarily rarely used).



The next thing you have to understand is the encoding. Just saying "I have unicode text" isn't specific enough... unicode characters can be represented several different ways. Some of the most common are:

UTF-8, where codepoints are represented by 1 or more single-byte characters

UTF-16, where codepoints are represented by 1 or more two-byte characters

UTF-32, where codepoints are represented by exactly 1 four-byte character.

To get even more specific... UTF-16 and UTF-32 need to have their endianness specified... since on disk, a multi-byte value can be represented in either big or little endian. So you could say that the encodings are really:

UTF-8
UTF-16LE (little endian)
UTF-16BE (bit endian)
UTF-32LE
UTF-32BE

UTF-8 doesn't need to concern itself with endianness because its values are only 1 byte wide.


Let's take a look at a simple example of how each encoding represents each codepoint. Let's start with an easy one... U+0061 (a):

1
2
3
4
5
UTF-8:     61           (0x61)
UTF-16LE:  61 00        (0x0061)
UTF-16BE:  00 61        (0x0061)
UTF-32LE:  61 00 00 00  (0x00000061)
UTF-32BE:  00 00 00 61  (0x00000061)


Pretty simple. This codepoint can be represented in 1 unit. The only difference between the encodings is how many bytes there are per unit, and the order in which those bytes are written.

Now let's look at a more complex one.... U+027B (ɻ)

1
2
3
4
5
UTF-8:     C9 BB        (0x61, 0xBB)
UTF-16LE:  7B 02        (0x027B)
UTF-16BE:  02 7B        (0x027B)
UTF-32LE:  7B 02 00 00  (0x0000027B)
UTF-32BE:  00 00 02 7B  (0x0000027B)


This codepoint is too large to fit in a single byte, so UTF-8 must use 2 units to represent it. UTF-16 and UTF-32, on the other hand, handle it simply.

Now a big bad boy... U+20031 (𠀱)

1
2
3
4
5
UTF-8:     F0 A0 80 B1  (0xF0, 0xA0, 0x80, 0xB1)
UTF-16LE:  40 D8 31 DC  (0xD840, 0xDC31)
UTF-16BE:  D8 40 DC 31  (0xD840, 0xDC31)
UTF-32LE:  31 00 02 00  (0x00020031)
UTF-32BE:  00 02 00 31  (0x00020031)


As you can see, this is too big to fit in a single 16-bit unit... so UTF-16 has to spread it out across two of them. Meanwhile, UTF-8 takes 4 units to represent it.



So what does this mean for your problem?

A few things.

#1 - You need to decide on how you want your data encoded. UTF-8 is common because it compresses well for English text, but can be clunky to work with if you are doing text editing since codepoints can frequently be of variable length. UTF-32 is the opposite: easy to work with because everything is fixed length, but uses a lot of space.

#2 - You don't need to use wchar_t's to support Unicode if you don't want. You can do it with normal chars. As long as all relevant code treats your string data as if it were UTF-8 encoded, you'll be fine.

#3 - The only time (afaik) that you need to use wchar_t's for Unicode text is when communicating with WinAPI... as it will treat wide strings as UTF-16 encoded strings, but will not treat char strings as UTF-8.

#4 - Just because WinAPI treats wchar_t strings as UTF-16 does not mean other libraries (like STL's wofstream) do. In fact... if memory serves, wofstream will actually try to 'narrow' the string you pass it before using it.



So if wide characters are not necessarily Unicode... and if wofstream narrows the strings you give it... then what good is wofstream?

Good question. I still don't know. But I know wofstream is weird and problematic enough that I avoid it entirely.


So as for you actual questions:
Now my question is, should I just be writing out the Character field via std::wofstream and std::ofstream separately and just finish off with the remaining fields with std::ofstream or write the whole object out in one write?


I would get rid of wofstream entirely. It does not help you at all in this endevour.

Does wchar_t really matter with numeric data (i.e. float, int, double)?


wchar_t is a character type. What you are doing when you pass your struct to ofstream::write is you are giving it a pointer to binary data.. and saying "treat this data as an array of bytes". It'll just blindly and faithfully write those bytes to disk.

When you give it to wofstream::write... you are saying "treat this data as an array of wchar_ts"... which are larger than a byte. Which makes things weirder.


Either way... you are not dealing with string data... so the cast is somewhat erroneous. Though ofstream (the non-wide one) will be more faithful since it won't mess with the data.



I went on and on about Unicode in this post... but if you want to know more about binary files... I strongly recommend you get a Hex Editor (a good free one is HxD) and actually look at the files you are creating to see if they match what you expect. I've also written some articles on writing binary files. Links below.

http://www.cplusplus.com/articles/DzywvCM9/
http://www.cplusplus.com/articles/oyhv0pDG/
then what good is wofstream?

I found reading/writing UTF-8 files into/from wstrings to be pretty convenient (on Linux, of course, where UTF-8 is supported and wchar_t is 32-bit)
Disch,

Thanks a lot for your very detailed response. I can tell you know your stuff ;). Before responding to you, I did a little more digging around and found this post that you commented on.

http://www.cplusplus.com/forum/beginner/10057/

You mentioned that std::string is a joke when it comes to UNICODE and that you ended up creating your own string class. May I ask why? I ended up creating my own String class too, but it uses std:string and std::wstring underneath. I've actually always used std::string since it became available in the STL and everywhere I've read, recommends using it. The reason for creating my own String class is to abstract away from exposing std::string publicly and to use wchar_t and char depending on whether UNICODE is specified.

Above you mentioned that using wchar_t in my case wouldn't matter. If I were to add ɻ to a Glyph and save to a file, I could use just a normal char? And this would work cross-platform Win, Linux, OSX? I know Linux and OSX treat std::string differently than on Win. To my recolection, std::string can handle chars past 255 correct? Where as Win doesn't and is recommended to std::wstring.
@Cubbi:

If that were standardized behavior, it would be great. Though as far as I can tell, it's implementation defined.


You mentioned that std::string is a joke when it comes to UNICODE and that you ended up creating your own string class. May I ask why?


I was young. =P

Though really, at the time I was working with string data on an individual glyph basis. For which std::string is not suitable on its own. IE, the [] operator will not give you a glyph because a single glyph might consist of multiple chars. The size() and length() functions will not give you the length of the string in glyphs, but rather the length in chars ... which... if you want to know how many glyphs you have... is worthless information.

The string class I wrote back then addressed those two issues.

I've actually always used std::string since it became available in the STL and everywhere I've read, recommends using it.


std::string is very good at what it does.

What it is... is a container for a group of characters that is blind to the encoding. For what I was doing at the time, I did not want my container to be blind... I wanted it to be aware of Unicode encodings and be able to process them automagically. Hence why I made my own.

Though for most purposes, std::string will do the job wonderfully. And you are correct for using it.

Above you mentioned that using wchar_t in my case wouldn't matter.


I shouldn't say wchar_t doesn't matter... since it is a wider type than a char. The overall message I was trying to illustrate was that wchar_t is not necessarily Unicode.

You can have Unicode-friendly code with or without wchar_t
and you can have Unicode-unfriendly code with or without wchar_t.

If I were to add ɻ to a Glyph and save to a file, I could use just a normal char?


Well again this comes down to encoding... IE, the string of bytes used to digitally represent that glyph.

If you have just a single char... then that char can only contain values 0x00 through 0xFF. So no, that would not be enough information to represent every Unicode codepoint.

Likewise... wchar_t on Windows is 16-bits wide... which means it has a range of 0x00 through 0xFFFF.. which again... is not enough to represent every Unicode codepoint.

Your best best is to have a 32-bit variable to hold the codepoint. At least in that 'Glyph' struct that you are writing to the file.

And this would work cross-platform Win, Linux, OSX?


Unicode is the same everywhere. So yes it would be cross-platform.

The only thing you really have to be worried about is endian issues. (see: http://www.cplusplus.com/articles/oyhv0pDG/ for explanation)

Although note that wchar_t is not the same everywhere (it's 16-bit on Windows, but 32-bit on other platforms)... so it would not be portable to use it in that way.

I know Linux and OSX treat std::string differently than on Win.


Technically, they don't. std::string is the same everywhere.

What's different is what the terminal/console/Windowing system does with the string data you give it. *nix tends to treat char strings as UTF-8. Windows does not.

Again remember that std::string is unaware/agnostic of the encoding of the string data it contains. All it knows is it has an array of chars. And characters... at their core... are just numbers.

The I/O mechanism (like the terminal) is what needs to take that string of numbers and turn them into glyphs.

So yes, Linux and OSX's terminal treat string data differently than Windows' terminal does. But the actual std::string class itself is the same on all platforms.

To my recolection, std::string can handle chars past 255 correct? Where as Win doesn't and is recommended to std::wstring.


std::string is just an array of char's.
std::wstring is just an array of wchar_t's

On all the platforms mentioned... a char is 1 byte... which means 0xFF (255) is the highest value possible.

The thing to note with encodings like UTF-8 is that multiple chars can be joined together to form 1 glyph... depending on how the string data is interpreted.

So if you have a UTF-8 encoded string.. you can use a std::string to hold that data. Doing so will be portable no matter where you are.

The question is... if you give a UTF-8 string to something Windows-specific... does it recognize it as UTF-8? The answer is usually no... whereas on *nix, the answer is usually yes.


EDIT:

For your 'Glyph' struct that you are writing to a file... the encoding of the glyph does not matter here... since you are not giving it to an I/O device for it to interpretted. You are merely writing the binary data to a file so you can read it later. Therefore any format you choose does not matter and will be completely portable as long as it is consistent... and as long as it is capable of representing any codepoint.

Just, again... be wary of endian issues.
Last edited on
@Cubbi:
If that were standardized behavior, it would be great. Though as far as I can tell, it's implementation defined.

Yes, I find it annoying that C++98 (and, really, C95 whose I/O model it inherited) allowed the OS vendor to decide which character encodings are supported.. or that Microsoft decided that it won't support UTF-8 in their OS (except where strictly required by C++11.. but then GNU decided that it won't support C++11's Unicode conversions.. just use boost)
Thanks Disch for the detailed explanation.

One question, since the OP is opening the file in binary mode, I would have thought that wofstream would not change the encoding.

To the original poster, if you have to worry about portability then you need to worry about data alignment as well as endian-ness. At work, we write nearly all binary data using XDR encoding. It's old but fast, stable and well supported.
dhayden wrote:
I would have thought that wofstream would not change the encoding.

For every wchar_t that a wofstream takes from the programmer it writes one or more chars to the file. How would that function without "changing the encoding"?
I would just like to point out that the things everyone here is referring to as "glyphs" are actually closer to being characters. A glyph is an image used to represent a character visually.
Last edited on
@helios:

Yeah. I know it's not quite the right technical term. I just wanted to avoid the confusion of the association between a "character" and a char, which are two different things (since a character can consist of several chars).
For every wchar_t that a wofstream takes from the programmer it writes one or more chars to the file. How would that function without "changing the encoding"?

With an ofstream, the binary flag basically controls whether it's outputting characters or opaque bytes. I was wondering if wofstream did the same thing. From what you're saying, it sounds like binary just controls end-of-line encoding as it does with ofstream.

So really there's no good reason to use wofstream to write binary data, right?
The Unicode/UTF stuff is such a huge mess in C++, it makes you want to just give up. I'm still reluctant to touch it but I actually have to for one of my projects. I just wish it was better supported by C++ and by compiler/stdlib implementers.
Last edited on
Topic archived. No new replies allowed.