Wide strings vs UTF-16 strings

I've been needing to deal with the world of Unicode recently and am confused about a few things. For one, what's the difference between Microsoft's Unicode, UTF-16, C++'s wide character strings, and C++'s UTF-16 character strings?

Which of these string literals should I use for MS Unicode?
1
2
L"Wide string literal";
u"UTF-16 string iteral";


What's the proper portable way to convert between MS Unicode/Wide character strings, UTF-16, and UTF-8?

The world of Unicode seems like a mess to me when the Windows API is involved, so anything to help explain away the confusion would be nice.
From what I have gathered, for WinAPI the preferable type to use is TCHAR, which can be either wide or narrow depending on your compilation settings. From memory, it is either multibyte (normal character strings, i.e. "string") or wide char (not UTF, i.e. L"string"). You then have string literals defined by the TEXT macro.

So, Microsoft's Unicode == C++'s wide character strings, UTF16 == C++'s UTF16. Using Unicode with the C++ standard library can be done with the 'w' functions (e.g. wcout, wprintf, wstring, etc.), but the other UTF formats require using facilities such as <codecvt> and <locale>.
Last edited on
Yes, that was what I gathered as well, but it's still incomplete to me. What exactly is the difference between a wide character string and a UTF-16 character string, and why? A history lesson would be nice, I suppose, but that may be too much to ask.
Last edited on
For characters in the Basic Multilingual Plane, UTF-16 is identical to UCS-2.

The size of wchar_t is implementation-defined.
GCC/LLVM define it as a 32-bit type (UCS-4), designed to hold text in UTF-32 encoding.
Microsoft compilers define it as a 16-bit type (UCS-2), which can be used to hold text in UTF-16 encoding.

For portable representation of text, use either char16_t (prefix: u) or char32_t (prefix: U).


> What's the proper portable way to convert between MS Unicode/Wide character strings, UTF-16, and UTF-8?

http://en.cppreference.com/w/cpp/locale/wstring_convert
http://en.cppreference.com/w/cpp/locale/wbuffer_convert (for streams)

Note: Portable among conforming implementations (for instance GCC doesn't have these as yet).


Last edited on
the C and C++ standard requirement for wchar_t is to be the widest possible code point (that's 32 bits if you're talking Unicode). Microsoft introduced wchar_t back in 1995-ish, I guess, when Unicode was still 16-bit, and they never upgraded.

Today, C++ has four kinds of strings:
std::string (can store ascii, iso8559-x, utf-8, gb18030, or any other single byte or multibyte encoding as long as the storage format uses bytes)
std::u16string (can store utf-16, ucs2, or any other 16-bit encoding)
std::u32string (can store utf-32/ucs4 or any other 32-bit encoding if any other one exists)
std::wstring (was *supposed* to store ucs4 or any other 32-bit encoding, and does so on Linux/Unix, but actually stores utf-16 on Windows for backwards compatibility reasons)

I haven't programmed on Windows, but as far as I know, L"Wide string literal"; is what every Windows API expects in so-called "Unicode" mode.

What's the proper portable way to convert between MS Unicode/Wide character strings, UTF-16, and UTF-8?

MS wide strings are already UTF-16le. To go to UTF-8, you can use Windows API WideCharToMultibyte (with CP_UTF8) or C++11's std::codecvt_utf8<wchar_t> (very easy to use if wrapped in std::wstring_convert, Windows supported that since VS 2010)
Last edited on
Thanks guys, that clears things up very nicely.
Topic archived. No new replies allowed.