reading Japanese text from UTF-8 text file

Hi, I've been struggling with this problem.
I google searched a long time but still can't figure this out.
I have a UTF-8 format text file with Japanese phrase in it.
I tried to use wifsream to read the file into wstring but the string holds some garbage information instead of Japanese. Anyone know how to do this?

I am using Win32 API.

"test.txt" contains: <Japanese>これは日本語の文です</Japanese>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
wstring wreadinput(wifstream &file) // read all data from file (wchar_t)
{
	wstring str;
	wstring strtemp;
	wchar_t bos[3]; // byte order mark

	file.read(bos, 3); // take bos out
	while(!file.eof())
	{
		getline(file, strtemp);
		str.append(strtemp);
	}
	return str;
}


 ...(in some other function)...
   // read the text file
   wifstream file;
   file.open("test.txt"); 
   if(!file)
   {
      return FALSE;
   }
   wstring str_test;
   str_test = wreadinput(file);


str_test read was: <Japanese>これは日本語の文です</Japanese>

Is there something to do with locale?
Last edited on
unfortunately, the standard libs are ignorant of Unicode, so this isn't as easy as it should be.

You'll have to manually decode the UTF-8 (or find a lib that does it). Details of UTF-8 are here: http://en.wikipedia.org/wiki/UTF-8#Description

Note that UTF-8 has 8-bit entries, so they're not wide. So reading a wstring isn't going to work (but it might work if the text file is UTF-16). Also, if you're reading a text file, you probably shouldn't be opening it as binary.


lastly -- even if you get it working, printing the string to the user isn't easy on some platforms (Windows console). wcout is effectively totally useless -- you can't just feed it Unicode strings like you might expect.

I don't know if you're outputting to the Windows console, but in the event you are, you might want to read this thread: http://www.cplusplus.com/forum/windows/9797/page3.html#msg46844
Thanks for the reply.
Yes I shouldn't read as binary, forgot to take that out.
Anyway, I do noticed that output to console is difficult.
So I'm actually using Win32 API.
I shall update my first post.

I read the UTF-8 stuff but still not sure of how to do the decoding in C++.
Is there any example?
I was bored. Here you go.

Note it's a little long/complicated, but it also validates the string to make sure it's valid UTF-8, and it accounts for all edge cases I could think of. Nothing should trip it up -- should be very sturdy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
std::wstring FromUTF8(const char* str)
{
    const unsigned char* s = reinterpret_cast<const unsigned char*>(str);

    static const wchar_t badchar = '?';

    std::wstring ret;

    unsigned i = 0;
    while(s[i])
    {
        try
        {
            if(s[i] < 0x80)         // 00-7F: 1 byte codepoint
            {
                ret += s[i];
                ++i;
            }
            else if(s[i] < 0xC0)    // 80-BF: invalid for midstream
                throw 0;
            else if(s[i] < 0xE0)    // C0-DF: 2 byte codepoint
            {
                if((s[i+1] & 0xC0) != 0x80)		throw 1;

                ret +=  ((s[i  ] & 0x1F) << 6) |
                        ((s[i+1] & 0x3F));
                i += 2;
            }
            else if(s[i] < 0xF0)    // E0-EF: 3 byte codepoint
            {
                if((s[i+1] & 0xC0) != 0x80)		throw 1;
                if((s[i+2] & 0xC0) != 0x80)		throw 2;

                wchar_t ch = 
                        ((s[i  ] & 0x0F) << 12) |
                        ((s[i+1] & 0x3F) <<  6) |
                        ((s[i+2] & 0x3F));
                i += 3;

                // make sure it isn't a surrogate pair
                if((ch & 0xF800) == 0xD800)
                    ch = badchar;

                ret += ch;
            }
            else if(s[i] < 0xF8)    // F0-F7: 4 byte codepoint
            {
                if((s[i+1] & 0xC0) != 0x80)		throw 1;
                if((s[i+2] & 0xC0) != 0x80)		throw 2;
                if((s[i+3] & 0xC0) != 0x80)		throw 3;

                unsigned long ch = 
                        ((s[i  ] & 0x07) << 18) |
                        ((s[i+1] & 0x3F) << 12) |
                        ((s[i+2] & 0x3F) <<  6) |
                        ((s[i+3] & 0x3F));
                i += 4;

                // make sure it isn't a surrogate pair
                if((ch & 0xFFF800) == 0xD800)
                    ch = badchar;

                if(ch < 0x10000)	// overlong encoding -- but technically possible
                    ret += static_cast<wchar_t>(ch);
                else if(std::numeric_limits<wchar_t>::max() < 0x110000)
                {
                    // wchar_t is too small for 4 byte code point
                    //  encode as UTF-16 surrogate pair

                    ch -= 0x10000;
                    ret += static_cast<wchar_t>( (ch >> 10   ) | 0xD800 );
                    ret += static_cast<wchar_t>( (ch & 0x03FF) | 0xDC00 );
                }
                else
                    ret += static_cast<wchar_t>(ch);
            }
            else                    // F8-FF: invalid
                throw 0;
        }
        catch(int skip)
        {
            if(!skip)
            {
                do
                {
                    ++i;
                }while((s[i] & 0xC0) == 0x80);
            }
            else
                i += skip;
        }
    }

    return ret;
}



Usage:

1
2
3
4
5
6
7
string utf8;  // note it's a string, not a wstring

utf8 = ReadUTF8FromFile( yourfile );

wstring unicodestring = FromUTF8( utf8.c_str() );

// give unicodestring to WinAPI 



EDIT: removed tabs. correcting casting error

EDIT 2: forgot about values F8 and up
Last edited on
Oh my god! It works!
I don't understand half of those code, but the conversion was perfect.
It even works with Chinese.
Thank you Disch, you saved this guy in distress.
I've been searching on the Internet for a whole day and wasn't able to find a solution as awesome as this.

Once again thanks.
hello bluewind,

this http://utfcpp.sourceforge.net/ as a lib looks good to me for you purpose
closed account (EzwRko23)
Oh, the almighty Boost doesn't have UTF-8 support? What a pity.
Oh, the almighty Boost doesn't have UTF-8 support? What a pity.


Maybe it does. I honestly didn't check.
There might be a Boost.Unicode coming in the near future; I think it was being considered earlier this year.
@coder777
Yeah, I did checked out that library, but wasn't sure how to use it.
It only has UTF-8 to UTF-16 and UTF-32 conversion.
It only has UTF-8 to UTF-16 and UTF-32 conversion.


FWIW, all my function does is convert UTF-8 to UTF-16 (if wchar_t is 16-bits) or UTF-32 (if wchar_t is larger)
Ha ha, I see.
I don't have experience with doing Unicode conversion.
Still, your code works fine, so I would like to stick with it.

Topic archived. No new replies allowed.