UTF-8 in command prompt (console)

Pages: 123
Well... it works and it doesn't... depending on your definition of "work".

It works in that when the font is Lucida console, the output is
Текст на кирилица?????????

But those "?" are disturbing... also, there's a square at first... I'm assuming the square is the BOM, but what's with the "?"s?

In addition, it doesn't work with the default command promt font, which is a big loss.

Last, though not that important, it needs a file... is there a way this could be embedded into the source code, or pretty much any way where it won't need a separate file for that at runtime? Then again... using files for the interface is a good way to make a program work with multiple languages, switchable by the user... but then how can I make it display the same in both command promt fonts?

I'm curious... what's that _T() function? And what's with the wmain()? I mean... is it needed?
Last edited on
It works in that when the font is Lucida console


it is required, you answer one question.. without unicode support or without unicode fonts can you write unicode?? normal console is not unicode enabled, they have ascii fonts and hence when you print it outputs the correct thing but console is not supporting it and hence it gives garbage.

Note: let me see if we can change the font of console from the application. that will better what say?? ;)

a square is a BOM, you can avoid by ignoring the first character.

the ???? you are getting, but i was not getting them.. check what are you reading from the file.. are you reading some garbage also???

using file to take input is a very good idea as you dont have to compile your code again and again.. but if you dont want to use external files then make your .cpp file unicode and then you can paste your local language in the code only.

_T() or L or _TEXT are all same. they make a ascii string to unicode string. when your application is unicode all the strings should be wide strings. now when you write this:

_tprintf("Hello World");

as the application is unicode all the functions changes from char to unsigned char. now if you pass as above it will say cant convert from char to unsigned char.
when you do this:
_tprintf(_T("Hello World!!"));

this makes the string wide and the compiler stops shouting..
and yes.. wmain and _tmain are same .. the entry point of the application changes when you make your application unicode compliant..
so instead of main() they changes to wmain/_tmain().

otherwise the compiler give error unresolve external symbol wmain as it will look for wmain now and not main().

hope all this is clear now.
change the font of console from the application:


1
2
3
4
5
6
7
HWND GetConsoleWindow();

BOOL WINAPI SetCurrentConsoleFontEx(
  __in  HANDLE hConsoleOutput,
  __in  BOOL bMaximumWindow,
  __in  PCONSOLE_FONT_INFOEX lpConsoleCurrentFontEx
);



http://msdn.microsoft.com/en-us/library/ms682073(VS.85).aspx


Also, remember that wcout and the like don't actually output unicode. They just use the ostream::narrow() method on the data you send through it, so you have to make sure that the correct conversion is being done there too.

Alas.
oh... is it so.. i wasn't knowing that..
thanks.. :)
@writetonsharma
I've checked desiredOutput.txt (the one which you returned, and which was processed) with the binary editor - like I though, there are no extra bytes at the end. Assuming each character takes 4 hex digits (only the two spaces taking 2 hex digits each the BOM taking the first 6), there's nothing following the final "а".

Maybe I should've placed a more concrete priorities and desires set. Sorry about that.

Is is required (i.e. no sacrifices here) that the solution:
- Works with Windows XP and later.
- Allows reading and manipulation of Cyrillic.
- Works regardless of locale settings (on all supported OS-es).
- Works regardless of command prompt settings (on all suppored OS-es).

It is a plus, but is not required (i.e. there may be sacrifices here) that the solution:
- Works on Linux and MAC OS-es, i.e. it's portable.
- Works for arbitary characters, not just Cyrillic.
- Is a standard library, or an easy to install and use 3rd party library.

The solution for explicitly setting the code page to cp866 and then using AnsiToOem() fulfills all requirements, but none of the plus-es. Each solution after that appears to negate some of the requirements.

SetCurrentConsoleFontEx() - works only on Vista. I'm actually not sure if XP supports Lucida console to begin with. Best solution from this point on is finding how/if XP has Lucida console, and how/if to set it. If XP doesn't have Lucida console, than this is a total no no.

Using wmain() - kills the portability plus and the command prompt settings requirement. No way to go from here really, as even if the command prompt settings requirement was fulfilled, this still wouldn't be better than the initial solution.

Writing UTF-8 in general - fulfils the arbitary characters plus, but unless you copy a file in binary, there's not even a chance you'll be able to output UTF-8. In addition, unless the font is set to Lucida console, you see crappy output. Best solution from this point on would be to use a library that converts UTF-8 file into a proper ANSI, dynamically setting the code page as it outputs it and/or which stores UTF-8 strings in a specialized object. Anyone know of such library? How to install it and use it in Visual Studio? Examples? I have no idea how to use the libraries Duoas gave, and their documentation isn't very supportive of Visual Studio (not much new here...).

BTW, in any case, thank you all for digging this mess up. I've already learned more about C++ from this topic in a few days than I have for the last few weeks from other sources (the site's tutorial and reference included... they are helpful to get started and as reminders later, but it's hard to find new stuff after them).
Last edited on
I agree the solutions provided thusfar have been substandard. I can't believe how difficult a thing it is to print standard Unicode with standard C++ and standard libs. I'm rather appauled.

From the looks of things finding one that works and is portable is looking slim (unless you riddle your code with #ifdefs or enclose all of your i/o through wrapper functions). I'd also be weary that a solution that works might only work when compiled with MSVS and not with another compiler. From the looks of some of this stuff (_tmain? wtf?) it looks very sketchy.

At any rate I wish you success and hope you find a way to meet your needs. I don't think I'll be of any more use in this topic though, unfortunately. =(

Again.. best of luck!
Disch:


I'd also be weary that a solution that works might only work when compiled with MSVS and not with another compiler


correct.. but i asked this thing at the beginning of the thread that it will not be a cross platform compilation.

From the looks of things finding one that works and is portable is looking slim

if we take wchar_t instead of TCHAR the it is portable. wchar_t works on linux also. its a unsigned char type. instead of TCHAR if i have used wchar_t and wprintf then it must have compiled on other platforms also.. though i never tested this. will try it on linux .. will let you know... :)

hahahaha.. _tmain.. its microsoft specific.. what can i say..billy wanted a new entry point for unicode applications.. he didnt listen to me .. :(
boenrobot

I've checked desiredOutput.txt (the one which you returned, and which was processed) with the binary editor - like I though, there are no extra bytes at the end.

debug the application and see when it reads from the file:
input.read(Buff, size);
what it is reading.. is it reading some garbage??? i ran it on winxp, and you have vista.. may this is the issue.. try something....

Is is required (i.e. no sacrifices here) that the solution:
- Works with Windows XP and later.
- Allows reading and manipulation of Cyrillic.
- Works regardless of locale settings.
- Works regardless of command prompt settings.


these will be fulfilled.. correct..??

It is a plus, but is not required (i.e. there may be sacrifices here) that the solution:
- Works on Linux and MAC OS-es, i.e. it's portable.
- Works for arbitary characters, not just Cyrillic.
- Is a standard library, or an easy to install and use 3rd party library.


1.that was the deal that the solution will work on windows only.. so no linux/mac!!!
2. it will work with any character set. it is fullfilled.. try it for hindi (india's national language).. it will work..
3. ????


SetCurrentConsoleFontEx() - works only on Vista. I'm actually not sure if XP supports Lucida console to begin with. Best solution from this point on is finding how/if XP has Lucida console, and how/if to set it. If XP doesn't have Lucida console, than this is a total no no.


i did it on winxp only.

Using wmain() - kills the portability plus and the command prompt settings requirement

you can use #ifdef blocks for windows/linux.. thats not a big problem.. will tell you the solution for this.

its not lucida console only.. you can use any unicode font.. like Arial Unicode MS.. they all will work.
second point.. if you are doing a unicode programming you have to set the font to a unicode compitable font.. that you have to.. be it console, GUI stuff like edit boxes, list boxes.. etc etc.
a little complexity is there handling unicode..
@writetonsharma
Wha...? You're saying SetCurrentConsoleFontEx() has worked for you on XP?!? It was able to actually compile and switch your console font from raster font to Lucida console? If so, then why does its documentation say it requires Windows Vista or Windows Server 2008 (look at the bottom of it)?

I realize that another unicode aware font would work as well, but in Vista at least, you can only set the command prompt to those two by default. There is no GUI for the user to select a third font, so I'm assuming applications aren't allowed to choose a third font either.

OK, I'm sold. If this works, it would indeed be better than the original in that it would work for all characters, even if it isn't portable. But if the console font can't be set on XP to a unicode aware font by the application, this kills the 1st and 4th requirements.

(I'll try to debug the application now...)
Last edited on
I havent used it but whats the problem..
what you have to do is this:

#define _WIN32_WINNT 0x0500
and it will pick the declaration of the function from the header..

regarding unicode font.. even if we are successful in setting the font to lucida console..that would do.. correct.. we just want that the string should be displayed..be it any font.

i will let you know how to use SetCurrentConsoleFontEx() !!!
for windows xp or later use this:

GetCurrentConsoleFont()
O..K... and how do I set the current console font on XP? It appears there's no SetCurrentConsoleFont(), only SetCurrentConsoleFontEx().

I can't get anything meanigful out of debugging. I get
Текст на кирилица췍﷽﷽ꮫꮫꮫꮫﻮ

As the buffer contents, and the value of size is 32.
The text in characters is 16 (starting from 0), and if each unicode character is 2 bytes, than that should be OK. Spaces I think are still one byte though, so we could be having two "??" from there. Still, that doesn't explain the rest of the "?"s, and I'm not sure how to proceed from here.

BTW, I removed the BOM, so this 32 doesn't include it. The actual output was pretty much the same with it anyway (the difference was in the up front square, which is now gone).
Last edited on
oh sorry i saw getcurrentconsolefont().. my fault..

everything in a unicode string take 2 or more bytes.. the string is not of variable size..

on my system the program is working fine with the file you send..dont know whats happening at your end.. its difficult to tell.. :(
must be some compatibility issue between xp and vista..

see the string size what i am getting on my system is 35, 17 characters and one BOM.
try setting the size to 35(hard code it) and read the file. see if you get the correct output.. this will show that we are not calculating the size of the file correctly. try this first.. like:

1
2
size = 35;
Buff = new wchar_t[size + 1];


this is for sure output what you want and not garbage..
i was trying this program on linux.. with no success. i am not able to read even one byte from the file.. :(
I haven't really been keeping up with this thread as much as I should have. I'm currently puzzled by a few things... maybe you guys can answer them for me:

1) why are you using wchar_t arrays instead of wstring? Buffer overflow and trailing garbage characters are a nonissue if you use a container class (and read the file properly)

2) why are you reading from a file in the first place? This seems like an unnecessary step. If the desired end result is that he wants to be able to output text in his program (without just echoing an external file), isn't that where we should be focusing?

3) are wchar_t's UTF-16 encoded (2 bytes wide, handle surrogate pairs)? I know this is the case for some WinAPI calls, but is this true as well for console output? Or is it just a 16-bit fixed-width character (can't represent codepoints above U+FFFF). I suppose this is a minor thing.

4) didn't we already try this? I swear I tried SetConsoleOutputCP to set the encoding to UTF-8 and it didn't work. And I didn't see a UTF-16 option for SetConsoleOutputCP (at least none that didn't require .net) -- so what exactly is going on here?

5) does outputting with cout have a different end result than outputting with printf() (or whatever wide version of printf you're using)?
Like I said, I removed the BOM from my copy, so if yours is 35 bytes with BOM, then it should all be fine.

Nevertheless, I tried setting the buffer to 35 (as well as 40 and higher), and increasing the buffer makes it worse. With 32, the output is:
Текст на кирилица????????

and with 35, it's
Текст на кирилица????????????

(which is actually four extra "?")

It's interesting to note that with a smaller buffer (in the example below: 30), the output is:
Текст на кирилиц??????????

which is one less character from the file, and two more "?"... WTF?

Here's my copy of desiredOutput.txt in its binary form, in case there are still any doubts about its contents:
D0 A2 D0 B5 D0 BA D1 81 D1 82 20 D0 BD D0 B0 20 D0 BA D0 B8 D1 80 D0 B8 D0 BB D0 B8 D1 86 D0 B0
(exactly 32 bytes and no BOM)
1. yes we can use std::wstring, but for that we need to find a function which will fill wstring with data.. like equivalent to getline().
what is the definition of reading properly?? he must be reading the file till eof, but still he's getting some garbage.. i also dont understand..

2. you are correct.. he can do that.. who is stopping him.. but i suggested him to use a file so that he dont have to change his code every time a new string comes. secondly .cpp files are ascii files and if you paste unicode in them, the characters will lose their value and become ????. so he needs to first create unicode .cpp files and then start coding.. now the funny part.. when you open those unicode .cpp files in VC++ it shows the hex encoding.. hahahaha.. and not the text.. dont know the workaround.. :P

3. wchar_t is unsigned short value if i remember correctly, so making it 2bytes. But unicode strings can go to even 8bytes also.. in some character sets. a little story before unicode.. before unicode there used to be dbcs(double byte character set). now this type of string can have 1 byte or 2 bytes..in the same string..!!!!! confused?? yes if you dont alreay knew that.. now what programmer has to do is, check each byte if its two byte or 1 byte, if two byte then has to combine next two bytes in one and print, if its one then can print as it is.. it was just nightmare you can imagine. unicode is better.. hehehe. :D

4. dont know..or didnt understand..

5. dont know but as Duoas said we cant use wcout. i never tried.

was trying the program on linux but its not reading anything.. i am tired now.. came from office.. started it straight away..but no success.. you people can give it a try.. going for dinner now.. :(
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
//unicode_test.cpp
#include <iostream>
#include <fstream>
#include<wchar.h>
#include <string.h>
#include <locale.h>

using namespace std;

int main() 
{

	wchar_t *Buff;
	wchar_t arr[100];

	if (!setlocale(LC_ALL, ""))
        {
                fprintf(stderr, "Failed to set the specified locale\n");
                return 1;
        }
//try one..
	wifstream input("desiredOutput.txt",ios::binary);
	if(input.fail())
		return 0;

	input.seekg(0, ios::end);
	long size = input.tellg();
	input.seekg(0, ios::beg);

	Buff = new wchar_t[size + 1];

	input.read(Buff, size);
	
	wprintf(L"%s\n",Buff);
//	wcout << input.rdbuf();
	
	delete [] Buff;
	input.close();


//try two
/**
	wchar_t c;
	FILE *fp = fopen("uni_test.txt","r");
	if(fp == NULL)
		return 0;

	while(!feof(fp))
	{
		fread(&c,sizeof(wchar_t),1, fp);
		
		wprintf(L"%c", c); 
	}
	
	fclose(fp);
*/
	return 0;
}


#uni_test.txt
Текст на кирилица
wat???

its 32 in size.. on my machine its 35????!!!!
i am confused now.. funny things happening...

ok do one thing.. as the file size is 32 and in your program also its giving 32, put a '\0' manually in the string..
that should do.. if not god knows then... :(

1
2
3
4
Buff = new wchar_t[size + 1];
input.read(Buff, size);
*(Buff + size) = L'\0';
wprintf(L"%s\n",Buff); //now it should only print 32 characters only.. this should work.. my last try.. 


good night guys.. see you in the morning.. :)

Pages: 123