• Forum
  • Lounge
  • Are wide-character strings not as import

 
Are wide-character strings not as important on *nix as on Windows?

The windows API is implemented using unicode strings internally and Microsoft recommends always using unicode strings in windows applications.

After some brief experience with linux it seems like many or most libraries don't even accept wide-character strings in their APIs. I've also noticed some applications crashing when they encounter unicode characters.

So what's the difference between Windows and linux/unix when it comes to unicode? Should we use a unicode library in C++ applications on linux?
Most non-Windows OSes use UTF-8 encoding for Unicode.

The hysterical raisins are simply that MS made a business decision back before the Unicode standard was really standard. It was a good decision -- MS could support BMP Unicode long before anyone else could. The problem is that the rug was really pulled out from under their feet -- things have changed quite a bit.

Prefer using UTF-8 strings for everything and converting to UTF-16 when needed (for example, when interfacing with a WinAPI function).

Hope this helps.
Be aware that UTF-8 or "Unicode" and wide-char are two different things.
Per definition UTF-8 uses 1 byte characters for everything that matches ASCII in the code range of 0 - 127. For All characters beyond that UTF-8 uses more than one byte.

For example, the string "50 °F" in UTF-8 is internally:

0x35 0x30 0x20 0xC2B0 0x46 0x00

Notice the 2 byte value in between the standard ASCII values.

Standard wide-char (w/o Unicode) uses ASCII + a code page that defines characters between 128 - 255 depending on your system settings:

0x0035 0x0030 0x0020 0x00B0 0x0046 0x0000

For the string above you'll get this same result if you use wide-char with unicode.
Microsoft calls this UCS-2 which is basically UTF-16.

Using UTF-8 AND wide char sould give you: (although, I'm not entirely sure about that)

0x0035 0x0030 0x0020 0xC2B0 0x0046 0x0000



Please someone correct me if I'm wrong, the unicode / wide-char thing is quite messy.
Last edited on
0x35 0x30 0x20 0xC2B0 0x46 0x00
More like
35 30 20 B0 C2 46 00

Standard wide-char (w/o Unicode) uses ASCII + a code page that defines characters between 128 - 255 depending on your system settings
What you're referring to here is extended ASCII, not wide characters.

Microsoft calls this UCS-2 which is basically UTF-16.
No, UCS-2 is a fixed-width encoding of the first 2^16 Unicode code points. UTF-16 is a variable-width encoding, just like UTF-8.

0x0035 0x0030 0x0020 0xC2B0 0x0046 0x0000
More like (assuming little endian system)
35 00 30 00 20 00 B0 00 C2 00 46 00 00 00
More like 35 30 20 B0 C2 46 00

Oh right, I forgot about the endianess.


No, UCS-2 is a fixed-width encoding of the first 2^16 Unicode code points. UTF-16 is a variable-width encoding, just like UTF-8.

I know. I thought my post was confusing enough, already.

Topic archived. No new replies allowed.