Help with UNICODE?

Hello everybody,
I'm developing a simple notepad-like application, using only WinAPI.
I've added some macro checking to compile for unicode or ASCII without changing the source code. When I load a file in ASCII mode, I don't have any error, but when I do the same in unicode mode, I get chinese-like symbols (the file I load is plain ASCII)

If you have any suggestion, please write them to me.

Best regards,
dp1
When I load a file in ASCII mode, I don't have any error, but when I do the same in unicode mode, I get chinese-like symbols


I think you're misunderstanding what "Unicode mode" really means. In Visual Studio, all it really means is that the wide version of WinAPI functions are used by default. That's it. For you to actually use them properly, you must give them proper UTF-16 strings.

If you're just taking an ASCII string and expecting it to be UTF-16... it isn't. And it will get interpretted strangely.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// say you have this in an ASCII text file

hello!

// this is 6 bytes long, one byte for each character.  Each byte would be the
//  ASCII code for each character.  It would look like this in a hex editor:

68 65 6C 6C 6F 21

// where 0x68 is the ASCII code for 'h', 0x65 is the code for 'e', etc

// the problem you're having is that you are interpretting the file as UTF-16.  UTF-16
//  has 2 bytes per character (it actually can be up to 4 bytes, but don't worry about that
//  for now -- it's usually only 2)
//
// That same binary data:

68 65 6C 6C 6F 21

// interpretted as UTF-16 (little endian):

6568  6C6C  216F

// is only 3 characters long.  And those 3 characters are:
//   U+6568 ( 敨 )
//   U+6C6C  ( 汬 )
//   U+216F  ( Ⅿ )
//
//  As you can see, this text is nothing at all like "hello" 



If the text is in ASCII and you want to give it to a function that expects UTF-16... you need to convert the text data. Simply changing the compiler setting does not do this automatically... you have to write code to do it.

WinAPI has conversion functions but I can't for the life of me remember what they are because I hardly ever use them. I'll see if I can look them up...


EDIT: also that Unicode compiler setting is stupid. If you want to use the UTF-16 version of functions, just call them directly. Each function which accepts strings in WinAPI has 3 forms:

normal form: MessageBox (takes TCHAR strings -- a TCHAR is either a char or a wchar_t depending on that stupid Unicode setting). It's worth noting that TCHARs are incredibly stupid and you probably shouldn't be using them, which means you probably should never be calling the normal form of any WinAPI function that takes strings.

ansi form: MessageBoxA (takes char strings)

wide form: MessageBoxW (takes wchar_t strings, accepts UTF-16 text)


So if you want Unicode support, you'll probably need to always use the 'W' functions and just deal with wide characters all the time. Either that or set the code page to UTF-8 and use the A forms, giving them UTF-8 strings... but I don't know if that actually works or not.




EDIT 2:

Found the conversion functions:

http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx

You would want to use CP_UTF8 as the code page.

Untested example:

1
2
3
4
5
6
7
8
9
10
11
char ascii[100] = "This is a test string.";

wchar_t utf16[100];

int widelength = MultiByteToWideChar( CP_UTF8, 0, ascii, -1, utf16, 100 );

// output the ASCII text
MessageBoxA( NULL, ascii, NULL, MB_OK );

// output the UTF-16 text (should be identical)
MessageBoxW( NULL, utf16, NULL, MB_OK );
Last edited on
closed account (ozUkoG1T)
Well, the EDIT 2 is most effective also when developing such application keep in mind that the client using it would probably have different keyboard layout to yours so always consider Unicode as well.

Disch wrote:
normal form: MessageBox (takes TCHAR strings -- a TCHAR is either a char or a wchar_t depending on that stupid Unicode setting). It's worth noting that TCHARs are incredibly stupid and you probably shouldn't be using them, which means you probably should never be calling the normal form of any WinAPI function that takes strings.

What's so stupid about it??
Using it as it's intended to be used results in incredibly obfuscated code.

The whole point of TCHAR is that they can be either char or wchar_t depending on that setting, and you should write your code to allow for either case.

The benefit of that is you can compile your code without Unicode support if it isn't needed, but just flip a switch and recompile if you do need Unicode support. However I don't see why it'd be benefitial to ever build without Unicode support (does it make for a smaller binary? I know it doesn't improve performance).

If you're going to take the effort to support Unicode, you might as well just support it and not be wishy-washy about it.

To write proper TCHAR code you have to be constantly aware of the TCHAR's variable size and meaning. Even something that's relatively straightforward as expanding a UTF-8 string to UTF-16 becomes extra work as a result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
const char* userstring = "some 8-bit string data that game from a text file or something";

TCHAR buffer[1000];

// strcpy() will only work if TCHAR==char
// MultiByteToWideChar() will only work if TCHAR==wchar_t
//
// so to probably assign 'buffer' you have to #ifdef it
#if defined(UNICODE) || defined(_UNICODE)
MultiByteToWideChar( CP_UTF8, 0, userstring, -1, buffer, 1000 );
#else
strcpy( buffer, userstring );
#endif

SetWindowText( somewnd, buffer );



Of course the smarter thing to do here would be to have some kind of "toTchar" function which does this so you don't have to embed the #ifdefs in your actual code. But the #ifdefs still have to be there.

Compare that to just supporting Unicode unconditionally:

1
2
3
4
5
6
const char* userstring = "some 8-bit string data that game from a text file or something";

wchar_t buffer[1000];
MultiByteToWideChar( CP_UTF8, 0, userstring, -1, buffer, 1000 );

SetWindowTextW( somewnd, buffer );


So much simpler.
Topic archived. No new replies allowed.