When I load a file in ASCII mode, I don't have any error, but when I do the same in unicode mode, I get chinese-like symbols |
I think you're misunderstanding what "Unicode mode" really means. In Visual Studio, all it really means is that the wide version of WinAPI functions are used by default. That's it. For you to actually use them properly, you must give them proper UTF-16 strings.
If you're just taking an ASCII string and expecting it to be UTF-16... it isn't. And it will get interpretted strangely.
For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
|
// say you have this in an ASCII text file
hello!
// this is 6 bytes long, one byte for each character. Each byte would be the
// ASCII code for each character. It would look like this in a hex editor:
68 65 6C 6C 6F 21
// where 0x68 is the ASCII code for 'h', 0x65 is the code for 'e', etc
// the problem you're having is that you are interpretting the file as UTF-16. UTF-16
// has 2 bytes per character (it actually can be up to 4 bytes, but don't worry about that
// for now -- it's usually only 2)
//
// That same binary data:
68 65 6C 6C 6F 21
// interpretted as UTF-16 (little endian):
6568 6C6C 216F
// is only 3 characters long. And those 3 characters are:
// U+6568 ( 敨 )
// U+6C6C ( 汬 )
// U+216F ( Ⅿ )
//
// As you can see, this text is nothing at all like "hello"
|
If the text is in ASCII and you want to give it to a function that expects UTF-16... you need to convert the text data. Simply changing the compiler setting does not do this automatically... you have to write code to do it.
WinAPI has conversion functions but I can't for the life of me remember what they are because I hardly ever use them. I'll see if I can look them up...
EDIT: also that Unicode compiler setting is stupid. If you want to use the UTF-16 version of functions, just call them directly. Each function which accepts strings in WinAPI has 3 forms:
normal form: MessageBox (takes TCHAR strings -- a TCHAR is either a char or a wchar_t depending on that stupid Unicode setting). It's worth noting that TCHARs are incredibly stupid and you probably shouldn't be using them, which means you probably should never be calling the normal form of any WinAPI function that takes strings.
ansi form: MessageBoxA (takes char strings)
wide form: MessageBoxW (takes wchar_t strings, accepts UTF-16 text)
So if you want Unicode support, you'll probably need to always use the 'W' functions and just deal with wide characters all the time. Either that or set the code page to UTF-8 and use the A forms, giving them UTF-8 strings... but I don't know if that actually works or not.
EDIT 2:
Found the conversion functions:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx
You would want to use CP_UTF8 as the code page.
Untested example:
1 2 3 4 5 6 7 8 9 10 11
|
char ascii[100] = "This is a test string.";
wchar_t utf16[100];
int widelength = MultiByteToWideChar( CP_UTF8, 0, ascii, -1, utf16, 100 );
// output the ASCII text
MessageBoxA( NULL, ascii, NULL, MB_OK );
// output the UTF-16 text (should be identical)
MessageBoxW( NULL, utf16, NULL, MB_OK );
|