Unicode strings

Forum

Forum
General C++ Programming
Unicode strings

May 28, 2009 at 6:57pm

Okay, so I need to use some of them and I wanted some advice. Which of these is best?
1. std::wstring.
2. std::basic_string<uint16_t,std::char_traits<uint16_t> >.
3. Third party library.

I would really like to use 2, since it will have the same size in any platform, but the fact that I can't directly assign it a string literal is somewhat annoying. If someone could give me the prototype for the conversion operator (and if such a thing is possible) it'd be great. I can't figure it out.
As for 3, some suggestions would be useful. The closest the interface to the standard strings, the better. Keep in mind that the library should be small, so QStrings are right out.

My original plan was to derive from std::basic_string<uint16_t,std::char_traits<uint16_t> > and add some useful functions, but then I read the class wasn't designed to be inherited and also ran into a compiler error that didn't make any sense, so I gave up.

Last edited on May 29, 2009 at 1:17am

May 28, 2009 at 7:17pm

PanGalactic (1658)

Use std::wstring. That's what it is there for. It uses 32-bit characters on every platform I have used.

Converting from string to wstring is a conversion between ASCII (assuming the C locale) and Unicode. You will want libiconv for that.

May 28, 2009 at 7:22pm

helios (17607)

VC++'s xstring:

1
2

typedef basic_string<wchar_t, char_traits<wchar_t>,
	allocator<wchar_t> > wstring;

This particular implementation has 16-bit wchar_ts. My problem is that wchar_t is not guaranteed to be any size. An 8-bit wchar_t is useless, and a 32-bit wchar_t is a waste since I haven't written any encoding conversion functions to handle characters beyond U+FFFF, so the upper 16 bits will never be used.

Converting from string to wstring is a conversion between ASCII (assuming the C locale) and Unicode. You will want libiconv for that.

You do realize that Unicode is backwards compatible with ASCII, right? ASCII->Unicode conversion is merely a character-by-character copy.

Last edited on May 28, 2009 at 7:24pm

May 29, 2009 at 3:16am

PanGalactic (1658)

You do realize that Unicode is backwards compatible with ASCII, right? ASCII->Unicode conversion is merely a character-by-character copy.

Yes, I do know how to convert between ASCII and various Unicode formats. Describing it as "a character by character copy" is a dangerous simplification because those unfamiliar with the topic might be tempted to interpret that as byte for byte.

libiconv will convert between most any two character encodings, such as between 8-bit formats (e.g. ASCII, iso-8859-??, UTF-8) and UCS2 or UCS4. No need to write your own code conversion routines.

May 29, 2009 at 4:03am

helios (17607)

For the record, I was talking about converting code pages, not encodings. ASCII (code page) to Unicode (code page) really is a character-by-character copy.

My problem wasn't with conversion. I wrote the routines like a year ago. Reinventing the wheel, yes, but interesting, and I saved adding another dependency to the project. My favorite was Shift JIS. You haven't done code page conversion until you've put together two 65536 elements conversion tables.

Anyway, I eventually went with the second method. I don't use literals all that much, anyway, and when I do use them I can use UniFromISO88591(std::string("literal")).
...
I think I'll overload that.

Last edited on May 29, 2009 at 4:03am

May 29, 2009 at 4:15am

Duthomhas (13277)

No modern compiler will have wchar_t at less than four bytes -- two at the minimum for older compilers.

Unfortunately, the STL iostreams don't handle string classes other than those directly compatible with string and wstring with even a passing attempt at grace. So I would stick with a wstring -- just make sure you are compiling with wchar_t defined to the correct size. The GCC at least gives you options. I think other compilers do too... (but I could be wrong).

Even so, the STL wide streams don't actually use Unicode -- they just ostream::narrow() everything that comes their way...

Alas, Unicode in C++ is still a bit of a black art.
http://www.cplusplus.com/forum/general/3722/

"
A very good UTF-8 library is ICU http://site.icu-project.org/

There are also some nice little UTF-8 handling libraries that various people do:
http://utfcpp.sourceforge.net/
http://www.codeproject.com/KB/string/utf8cpp.aspx
http://www.gnu.org/software/libidn/

Hope this helps.
"
_{http://www.cplusplus.com/forum/windows/9797/page1.html#msg45628}

Good luck!

May 29, 2009 at 5:00am

helios (17607)

No modern compiler will have wchar_t at less than four bytes -- two at the minimum for older compilers.

Well, I just printed sizeof(wchar_t) with VC++ 2008 and got this:
2

I/O is not a problem for me, since all my files are open as binary and anything that doesn't fit in 7 bits is automagically converted to UTF-8. Unless requested through the command line, there's no output to the console. Finally, the largest volume of text output, if you can call it that, goes to graphics, so I don't need to concern myself with such details as the size of my characters.

ICU looks nice, though. I may use it next time I need to do conversion instead of rolling my own.

Yeah, I think I'll rewrite it for std::wstring. I already had a macro that checked the size of wchar_t at compile time to make sure it's at least 16 bits wide, and a little wasted memory is no big deal.

Last edited on May 29, 2009 at 5:02am

May 29, 2009 at 10:33am

Duthomhas (13277)

> Well, I just printed sizeof(wchar_t) with VC++ 2008 and got this:
> 2
LOL. Well, most people using Unicode can still survive in 16-bit worlds... I guess.

Glad you got it working.

May 29, 2009 at 5:13pm

writetonsharma (1461)

its not necessarily 16 bits.. unicode can be 32 bits also.. some Russian or Japanese character sets.

May 29, 2009 at 5:39pm

Duthomhas (13277)

Unicode is defined on 21 bits -- 0x000000..0x10FFFF -- which translates to needing a 32-bit word in computing terms.

However, the Basic Multilingual Plane only needs 16-bits -- 0x0000..0xFFFF. That is enough for all Russion, BTW. People use the Supplementary Multilingual Plane mostly for eastern language support...

May 29, 2009 at 6:34pm

helios (17607)

It's worth noting that Unicode is a subset of the Universal Character Set, which is 48 bits wide in total.
IINM, all kanjis and kanas used by Japanese are inside the BMP, and I'm certain all used in everyday text (which were around 9000, I think. Memetic reference not intended) are inside the BMP.

Topic archived. No new replies allowed.