How can I display this: ą

I'm working with chars and I'm needing to write the character "ą" to a Windows edit box in my program. Anyone know how I can do it? I'm trying to return a string that has this character in it.
Last edited on
Which character set are you using?

If Google got it right, and the "ą" is a lower case "Ą" -- A with ogonek -- then with Unicode,

1
2
3
4
    wchar_t test[] = L"This is an \'a\' with an ogonek : \x0105";
    int x = 300;
    int y = 300;
    TextOutW(hdc, x, y, test, wcslen(test));


works for me, if I select a suitable font. The character is missing from the default font on my machine, so it appears as a small black retangle, but when I switch the font to Arial the char displays correctly.

So you might have to use WM_SETFONT to switch the font the Edit control is using to something that has an 'a' with ogonek in it.

Andy

PS Unicode value came from:

Ą
http://en.wikipedia.org/wiki/%C4%84
Last edited on
Just enable Unicode settings of your compiler (Visual Studio has enabled by default) or add this line to your header file before including windows.h:

1
2
#define UNICODE
#define _UNICODE 


Doing so you will be working with wide chars (wchar_t) rather than char's, changes will be needed to your existing code.
Last edited on
I am using the multi-byte character set which seems to NOT be the way to go after reading what I could find online and you guys' responses. My program is based on a somewhat dated C++ book and it went with that character set. I saw the option for the Unicode character set in the properties. When I switched to it I had to make a few changes to the code to accommodate this but when I ran the program, it worked but the items I had in my drop down menu were in what I assume to be Chinese rather than the names I gave them in the code.

My program was working great until I found that I needed to display that "a with ogonek" character. Using the multi-byte character set, I was using char arrays and strings to do what needed to be done. If it will help, here is a small background on my project.

I am working on a simple program that will take in text entered into an edit box. This text will be from old language sources that use their own orthography. The dropdown menu will allow the user to select which source it is from and parse the text accordingly. Then the program will output the text into our current orthography (writing system). And we use that "a with ogonek" character so that is why I need to display it.

Having said that, what I need to know is what variables to use. Like I said, I've been using chars and strings but near as I can tell, those need to go out the window for something else. Here is what I have my program doing right now:

I have a char array (input[256]) that holds the input from the user which is taken in via:

GetWindowText(hInput, input, 256);

Then that "input" is taken via:

SetWindowText(hOutput, MainParser(input, stringLength, iSource).c_str());

As you can see, my MainParser function takes in the input, the string length, and the source (from the dropdown menu) and then works accordingly.

The MainParser takes all that in and then feeds it into the appropriate function. For example, here is one:

string Merrill(char* pInput, int iStringLength);

That returns a string after taking in a char* to the input.

Then that goes through a series of if statements to check each character and then swap it out accordingly. Like this:

1
2
3
4
if (pInput[i] == 'æ')
		{
			term += "e";
		}


I use the iStringLength to determine how many times the for loop needs to run.

Now, out of all of that and if I want to use Unicode, what variables types/typedefs will I need to use in place of each of those? "wchar_t" was mentioned in place of char but what about the string? Will I still be able to use those or will something else be needed

Thanks for the replies!
Using the multi-byte character set


As far as I know, that setting does absolutely nothing apart from #defining the UNICODE and/or _UNICODE macros for the preprocessor.

And defining UNICODE only changes what 'TCHAR's are. But you're probably not using TCHARs because you're smart, and TCHARs are retarded. So I would not worry about this.


Now, out of all of that and if I want to use Unicode, what variables types/typedefs will I need to use in place of each of those? "wchar_t" was mentioned in place of char but what about the string? Will I still be able to use those or will something else be needed


Without getting to technical on what Unicode is vs. encoding formats, I will say that you definitely will want to use Unicode. You also have 2 general options:

1) Use UTF-16, where each "character" is 2 bytes (usually). On Windows this means using wchar_t and std::wstring for chars/strings.

2) Use UTF-8, where each "character" is 1 byte minimum, but characters outside the basic ASCII set are represented by multiple bytes. For example the ą character mentioned before is represented as 2 bytes (chars) in sequence: 0xC4, 0x84


Each have their pros and cons. The biggest downside to UTF8 is that it can get confusing if you're going to be working on individual characters like you are. For example, if you want to replace 'a' with 'ą', you will actually increase the size of the string, because 'a' is represented in 1 char, whereas 'ą' needs 2 chars. Some characters may even need 3 chars.

With UTF-16, pretty much everything (aside from very, very seldomly used glyphs) can be represented in a single 16-bit wchar_t. Also, WinAPI functions naturally accept wchar_t strings and interpret them as UTF-16.

So what it sounds like is that for you, UTF-16 (wchar_t, wstring) would be easier to work with.

-----------
If you want to use UTF-16: You must use wchar_t's, wstrings, "wide" literals.. ie:
1
2
L"foo";  // <- this, the 'L' makes it wide  (wchar_t)
"foo";  // vs. this, which is narrow  (char) 

And you must call the 'W' version of WinAPI functions to indicate you have UTF-16 strings. IE: SetWindowTextW instead of SetWindowText.

You'll want the 'W' version of any structs, too... such as OPENFILENAMEW, etc.


--------------
If you want to use UTF-8, you'll have some extra work, because I don't think you can give UTF-8 strings to WinAPI. I thought you could if you set the codepage to UTF-8, but after checking MSDN, I don't see any way to do that (except for the console).

This means you'll have to expand UTF-8 to UTF-16 with MultibyteToWideChar ( http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx ), then pass the UTF-16 string to WinAPI as described above. In light of this, using UTF-16 for everything seems more and more like the way to go.


---------
if (pInput[i] == 'æ')


I'm not sure this will work even with wide strings, as it depends greatly on how your text editor saves the .cpp file, and how the compiler decides to interpret that string.

I think there was some unicode support added to C++11, but I'm not entirely clear on how you'd apply it to this, as I thought it was only for string literals and not individual characters. Also it would still be subject to how your editor encodes the file.


The only way I know of to make this fullproof would be to use the raw U+ character code. For example.... 'Ǽ' is designed as U+01FC (take a look in the Windows CharMap program and it'll display all that stuff)... which means that instead of doing this:

if( pInput[i] == 'Ǽ' ) // <- which probably won't work

You could do this:

if( pInput[i] == 0x01FC ) // which will definitely work as long as pInput is UTF-16

Of course that's not as easy to read... or write....
I got it to work! Thanks guys! I'm still not 100% sure what's going on (this may be one of those "Do it now, understand it later" situations. I'll tell you this...they sure don't go over this stuff in those "hello world" programming tutorials that promise to turn you into programmer in an unreasonably short amount of time!

I think I see what's going on though. I may just need some time to let my brain get wrapped around it. I set the properties to the Unicode character set, I swapped out all of my chars with wchar_ts and strings with wstrings. And I put "L" before all of my strings (IE: term += L"e";).

Am I to assume that Unicode attempts to show every character possible and that the 1 byte character system couldn't do it (ran out of binary possibilities) so it uses 2 bytes?

And if I am going to use Unicode then I will now have to use wchar_t and wstring from here on out along with "L"? If so, I might as well get used to them.
Am I to assume that Unicode attempts to show every character possible and that the 1 byte character system couldn't do it (ran out of binary possibilities) so it uses 2 bytes?


Yes and no. Mostly you're right, but technically there's more to it.

Unicode assigns a 'codepoint' to certain glyphs. It's basically just a mapping system that assigns glyphs/characters a unique ID number.

UTF-8 and UTF-16 are different means to represent those characters. UTF-8 is based on 8-bit values and UTF-16 based on 16-bit... but both are variable length, meaning codepoints may need multiple values to represent them.

And if I am going to use Unicode then I will now have to use wchar_t and wstring from here on out along with "L"?


If you want to use wide strings then yes. It just so happens that WinAPI treats wide strings as UTF-16, so that's often the easiest way to handle unicode text on Windows.

However it isn't necessary. Remember that UTF-8 is also able to represent Unicode codepoints.. so it's perfectly possible (and rather common) to have Unicode in normal/narrow chars and strings.
As Disch has already pointed out, Windows is not very UTF-8 friendly. You have to convert all UTF-8 text to UTF-16 (or even to ANSI, if it can be represented as such) for display purposes, if you want to avoid mojibake.

(A new word to me: Mojibake
http://en.wikipedia.org/wiki/Mojibake

Converting from UTF-8 to ANSI required to Win32 calls MultiByteToWideChar with CP_UTF8 followed by WideCharToMultiByte with the required ANSI codepage identifier; a third party library or custom routine is required for the direct route.)

And the MBCS functions (mbslen, mbsinc, ...) cannot handle UTF-8, either. They only handle charsets which have 1 or 2 bytes per character, like code pages 932 (Shift_JIS) and 950 (Big5); that is, not very multi!

But you can display an ą with ANSI if you use a font for the right charset.

I checked this using an Eastern European Arial font:

1
2
3
4
5
6
7
8
    const char faceName[] = "Arial";
    const int nFontSize = 14;

    LOGFONTA logFont = {0};
    logFont.lfHeight = -MulDiv(nFontSize, GetDeviceCaps(hdc, LOGPIXELSY), 72);
    logFont.lfCharSet = EASTEUROPE_CHARSET;
    strcpy(logFont.lfFaceName, faceName);
    HFONT hFont = CreateFontIndirectA(&logFont);


and the string:

"This is an \'a\' with an ogonek : \xB9"

Where B9h is the code for an 'a' with an ogonek in the Windows-1250 codepage.
http://en.wikipedia.org/wiki/Windows-1250
... used under Microsoft Windows to represent texts in Central
European and Eastern European languages that use Latin script, such as
Polish, Czech, Slovak, ...

But, if at all possible, you should switch to UTF-16.

Andy

PS See table towards bottom of this page:

TranslateCharsetInfo
http://msdn.microsoft.com/en-us/library/aa915041.aspx

(this page is for Windows CE; the normal version of the page is missing the table.)
Last edited on
Thanks for the help, guys :). I got it working and your information on how Unicode works has gotten me started on learning something that I REALLY need to get into.
:-)

Given your user name Hildólfr, which I now know is Old Norse for "war-wolf" (and also the name of the son of Odin according to the Nafnaþulur list of the Prose Edda's Skáldskaparmál...),

Are you translating Old Norse documents??

Andy

PS Thanks to Wipipedia.org, of course:

Hildólfr
http://en.wikipedia.org/wiki/Hild%C3%B3lfr
No, although that wouldn't be a bad project! I work for a Native American tribe and I am looking for a way to crank out possibilities based on the orthographies in old documents. I can look at these possibilities in the current orthography and maybe see familiar terms, patterns, etc.

I'm not Native but I do have Norwegian ancestry which is where the inspiration for my name comes from. You are almost right about the translation. The term I use is "hljod" which is a sort of Anglicized version of "hljoð" (with the "eth" character which is typically changed to a "d" in English...for example, Oðin became Odin in English). This term is Old Norse for "silent" or, funnily enough, "sound." But I go for the "Silent Wolf" translation. Sounds cooler :).

That name is cited in Richard Cleasby's Old Icelandic dictionary (available for free download via Google Books) as "Hljoðolfr" and says it was a dverger (dwarf) name.

You are thinking of "hildr" (battle, fight, war).

http://www.nordicnames.de/wiki/Hildulfr
Topic archived. No new replies allowed.