Win32 unicode/multibyte character set compilation issue

Hi, there:

Simple code as such:
1
2
3
4
5
6
7
#include <windows.h>

int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd)
{
	::MessageBox(NULL, "GOOD", "NOTE", MB_OK);
	return 0;
};


will not compile under visual studio if we set the "character set" field(Project->property->configuration property) to unicode.

To solve the problem, we either change the "character set" field to multibyte
or
we use a more general _T() method. (example link: http://www.pcreview.co.uk/forums/multi-byte-characters-t3136613.html)


My question is, however, listed below:
If we set the "character set" field in project setting to unicode, the compilation error I got is this:
"error C2664: 'MessageBoxW' : cannot convert parameter 2 from 'const char [5]' to 'LPCWSTR'"

Okay, that means, the second parameter that MessageBox() expects is of type LPCWSTR. Since we are using unicode character set, "GOOD" is represented in unicode. Blam, type mismatch, compilation error ... everything looks reasonable.

But, in fact, LPCWSTR is a "32-bit pointer to a constant string of 16-bit Unicode characters, which MAY be null-terminated."(http://msdn.microsoft.com/en-us/library/cc230352(PROT.10).aspx)


Things start to get logically wrong (here's my confusion):
if MessageBox() is expecting a pointer to unicode characters
AND
"GOOD" is actually represented in unicode,
the compiler should NOT give me an error.

In my opinion, if we set the character set to "multi-byte" and get this error, things make sense. But the reality is the opposite of what I think is reasonable.

Can anyone explain to me why it is and what is wrong about my understanding?

Thanks a lot, and I'm looking forward to an answer.

If you don't #include <tchar.h> you're going to get compiler errors that try to use tchar declarations. I don't know what you mean by "AND GOOD is actually represented in unicode" because it is not represented in Unicode unless you have defined your project as such, and as I said, it then requires TCHAR's.

You also have to preface your TCHAR's with _T, like this --

MessageBox(NULL, _T("GOOD"), _T("NOTE"), MB_OK);

I don't really understand what you're asking on the logic end of it, but I'll let others who are much more experienced address that.
Last edited on
h9uest wrote:
Things start to get logically wrong (here's my confusion):
if MessageBox() is expecting a pointer to unicode characters
AND
"GOOD" is actually represented in unicode,
the compiler should NOT give me an error.


Stop thinking in terms of Unicode and non-Unicode. The difference here is the character type. When calling WinAPI functions, the type of string you pass must match the type of string it expects.

It's a little confusing, but simple once you understand it:

On Windows:
- char is 8 bits
- wchar_t is 16 bits
- TCHAR is #defined as either char or wchar_t depending on your Unicode settings.

That said:
- MessageBox takes TCHAR strings (LPCTSTR)
- MessageBoxA takes char strings (LPCSTR)
- MessageBoxW takes wchar_t strings (LPCWSTR)


Therefore if you're using the MessageBox function, you must give it TCHARs. If you're using MessageBoxA, you must give it chars, etc.

When using string literals:

1
2
3
"GOOD"  // <- this is a char string
L"GOOD"  // <- this is a wchar_t string
_T("GOOD")  // <- this is a TCHAR string 



Therefore:
 
MessageBox(NULL,"GOOD","NOTE",MB_OK);


This fails because you're passing char strings to a function that takes TCHARs.


All of the below would work:

1
2
3
MessageBox(NULL,_T("GOOD"),_T("NOTE"),MB_OK); // TCHAR string to TCHAR function - OK
MessageBoxA(NULL,"GOOD","NOTE",MB_OK);  // char string to char function - OK
MessageBoxW(NULL,L"GOOD",L"NOTE",MB_OK); // wchar_t string to wchar_t function - OK 
Last edited on
@Lamblion & Disch:

Thank you two very much for the detailed explanation. They are helpful.

I understand that the _T() method does work because it formats the char string into a tchar string which is expected by MessageBox().

I believe the real confusion lies in another workaround approach to this problem:
i.e. setting the "character set" field in (Project->property->configuration property) of visual studio.

If you set the project setting such that multi-byte character set is used, it seems that char strings are automatically treated as tchar string when necessary. In the meantime, if we use unicode character set in the project, then we lose this "automatic correction" feature.

That is what really confuses me, once again because:

if MessageBox() is expecting a pointer to unicode characters
AND
"GOOD" is actually represented in unicode,
the compiler should NOT give me an error.



Disch, you told me not to think about it in terms of unicode or multi-byte character. But what's wrong with it? Do you imply that the string "GOOD" takes one byte per character (i.e. char) regardless of my project setting? If that's the case, then what is the "unicode/multi-byte" project setting for?

Thank you, and hope to hear from you guys again.

I believe the real confusion lies in another workaround approach to this problem:
i.e. setting the "character set" field in (Project->property->configuration property) of visual studio.


That setting shouldn't matter. Properly written code will compile regardless of what that setting is set to.

The only reason that works is because TCHAR isn't really typesafe since it's just a #define. Ideally, you would still get the error even after trying that "workaround".

If you set the project setting such that multi-byte character set is used, it seems that char strings are automatically treated as tchar string when necessary.


Sort of. Changing that setting just makes TCHAR be defined as char, so char and TCHAR become interchangable. However you should not rely on that, as it makes your program dependent on that setting. It's best to just use the functions correctly as I outlined in my previous post.

if MessageBox() is expecting a pointer to unicode characters
AND
"GOOD" is actually represented in unicode,
the compiler should NOT give me an error.


This is confusing you because you're thinking about it the wrong way.

Unicode is just a character encoding. chars and wchar_t can both represent Unicode (in UTF-8 and UTF-16 respectively). No the term "Unicode" is meaningless in this context.

Yes, "Good" is a valid Unicode string (UTF-8), but it's a char string and therefore does not work with MessageBox, which is a TCHAR function. Whether or not it's Unicode doesn't really matter... what matters is the character type.

Do you imply that the string "GOOD" takes one byte per character (i.e. char) regardless of my project setting?


Yes.

"GOOD" is a char string and therefore is always 1 byte per character.
L"GOOD" is a wchar_t string and therefore is always 2 bytes per character (on Windows)
_T("GOOD") is a TCHAR string and can be either 1 or 2 bytes per character depending on the settings.


If that's the case, then what is the "unicode/multi-byte" project setting for?


Honestly I don't know what good it's for. I pretty much just always set it to Unicode because I have little reason not to.
@Disch:

Got it.
Again, thanks a lot. I think I can happily mark it as solved now :)
Topic archived. No new replies allowed.