Reading partially correct, but stops halfway - imbued wifstream into wstring with ConsoleCP changed to 65001 UTF8

Hello!

This is my code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <vector>
#include <iostream>
#include <fstream>
#include <locale>
#include <conio.h>
#include <Windows.h>

int main() {
	SetConsoleOutputCP(CP_UTF8);
	SetConsoleCP(CP_UTF8); // checked in cmd's properties after executing and the codepage is being correctly changed into UTF8 (65001)
	std::wifstream wif;
	wif.open("utf8.txt");
	std::wstring wstr;
	std::locale loc(""); // Polish
	wif.imbue(loc);
	wif >> wstr;
	std::wcout << wstr << " ";
	wif >> wstr;
	std::wcout << wstr << " ";
	wif >> wstr;
	std::wcin.imbue(loc);
	std::wcout.imbue(loc);
	std::wcout << wstr << " ";
	getline(std::wcin, wstr);
	std::wcout << wstr << " ";
	std::wcin >> wstr;
}


However, the program behaves strangely (the content of the file is also present on the screenshot):
http://www.bankfotek.pl/image/2093166.jpeg

After typing in the same content through wcin, this is what happens:
http://www.bankfotek.pl/image/2093167.jpeg

Could someone explain why the read from the file stops after "zażó", and then mysteriously "eats" the first letter after using wcout?
By extension, how to properly use UTF8 while reading from a file?

----------------------------EDIT------------
I tried changing this portion:
1
2
3
4
5
wif >> wstr;
std::wcout << wstr << " ";
wif >> wstr;
std::wcout << wstr << " ";
wif >> wstr;


Into getline(wif, wstr); and it did correctly read the entire line, however problem with the disappearing first letter after using wcin persists.

Moreover, the problem with wcin, even if imbued with proper locale is that all characters in the string that contain diacritics are converted to blank spaces. So, if I type
zażółć gęślą jaźń and then use wcout to show the input back, it reads:
za g l ja

Another interesting find is that deleting this portion wif.imbue(loc) results in the same "halfway-stopped" behaviour. I thought that the console codepage should suffice, why do I need to imbue? Am I doing something wrong?

----------------------------EDIT------------

Again, I tried to investigate the issue. I build this loop:
1
2
3
4
5
6
7
8
wchar_t ch;
unsigned i = 0;
while (wif.eof()==false) {
	wif.get(ch);
	std::wcout << ch;
	++i;
}
i;


After the loop variable i was at value 30, not the expected 18 (how many letters, whitespaces there are + \0 at the end). What's wrong here? I suppose wchar_t is 2 bytes long and UTF-8 is 4 on this machine and thus treats 2 codepoints as separate chars? This doesn't explain the discrepency though. 18*2 = 36.
And yes, the program, although reading new chars 30x times, stopped showing them on the console after "zażó".

----------------------------EDIT------------

With this code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
std::locale loc("");
std::wifstream wif;
wif.imbue(loc); std::wcout.imbue(loc); std::wcin.imbue(loc);
wchar_t ch;
while (wif.eof()==false) {
	wif.get(ch);
	std::wcout << ch;
}
while (std::wcin >> ch) {
	std::wcout << ch;
}
std::wcout << std::endl;
std::ignore() // pause 


I was able to properly read from a utf-8 encoded file, but somewhere between processing user console input in wcin and showing it in wcout the program fails to deal with diacritics again.
So "zażółć gęślą jaźń" becomes "za g l ja".

I'm also hesitant to use things like getline because I don't know where to imbue them - if I imbue std::wcin and then call getline like this getline(std::wcin, var); the result will be properly encoded?
Sincerely.
Last edited on
I'm also hesitant to use things like getline because I don't know where to imbue them - if I imbue std::wcin and then call getline like this getline(std::wcin, var); the result will be properly encoded?

Yes, std::imbue imbues the stream, in this case std::wcin, wcout, and wif.


Last edited on
Yes, std::imbue imbues the stream, in this case std::wcin, wcout, and wif.


I imbued std::wcin but it still leaves blank spaces instead of diacritics, no matter whether I use getline(std::wcin, wstr), std::wcin>> or std::wcin.get(ch) loop
What happens if you write the data to a file (that has been imbued()) instead of wcout?

But what should I imbue that stream with?
On Linux, from what I've gathered on the Internet, one can use "C.UTF-8" or "en_US.UTF-8" but these codes do not exist on Windows.

I tried outputting into an un-imbued file but that obviously wasn't succeseful (blank spaces instead of diacritics). It was also encoded in ANSI, even if I used wofstream and wstring.

EDIT: I also tried getting the current locale of the machine like this:
1
2
std::locale loc = setlocale(LC_ALL, "");
wof.imbue(loc);
but nothing changed.

Sincerely.
Last edited on
On Linux, from what I've gathered on the Internet, one can use "C.UTF-8" or "en_US.UTF-8" but these codes do not exist on Windows.

You don't need even that on Linux, it supports Unicode as designed, there is nothing to do as long as you remember that a character can occupy more than one char:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <fstream>
#include <iostream>
#include <string>

int main() {
    // on Linux, std::string uses UTF-8, std::wstring uses UTF-32 (aka Unicode)
    std::ifstream f("utf8.txt"); // not "wifstream" unless you need UTF-8 to UTF-32 conversion
    std::string s;
    std::getline(f, s);
    std::cout << s << '\n';

    std::getline(std::cin, s);
    std::cout << s << '\n';
}

live demo: http://coliru.stacked-crooked.com/a/6202daaa7f9f0e02

Windows doesn't support Unicode very well, as you're learning.
I am able to get your program to print the entire file just fine if I don't attempt to convert UTF-8 to UCS2 (the thing Windows puts in std::wstring):

1
2
3
4
5
6
7
8
9
10
11
12
#include <fstream>
#include <iostream>
#include <string>
#include <windows.h>

int main() {
    SetConsoleOutputCP(CP_UTF8);
    std::ifstream f("utf8.txt"); 
    std::string s;
    std::getline(f, s);
    std::cout << s << '\n';
}

screenshot: http://tiny.cc/hdgbny

and yes, input from a utf-8 console seems to swallow diacritics for me too - I wonder if there's a known workaround.

PS: the old _setmode hack still seems to work, I can input "zażółć gęślą jaźń" from console and get it printed back at me with
1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>
#include <io.h>
#include <fcntl.h>

int main() {
	_setmode(_fileno(stdin), _O_U16TEXT);
	_setmode(_fileno(stdout), _O_U16TEXT);
	wchar_t buf[100];
	fgetws(buf, 100, stdin);
	fputws(buf, stdout);
}

("old" because it predates their new course towards partial UTF-8 support)
Last edited on
I changed the code in accordance to your advice:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int main() {

	_setmode(_fileno(stdin), _O_U16TEXT);
	_setmode(_fileno(stdout), _O_U16TEXT);

	wchar_t buf[1000];
	fgetws(buf, 1000, stdin);
	// trailing \n removal
	buf[wcscspn(buf, L"\n")] = 0;

	std::ios_base::sync_with_stdio(false);
	std::wifstream wif; wif.open("utf8.txt");
	std::wstring str; std::wstring str2;
	getline(wif, str);
	std::wcout << "UTF8 File: " << str << std::endl;
	std::wcout << "FGETWS Input: " << buf << std::endl;

	std::locale loc("");
	std::wofstream wof; 
	wof.imbue(loc);
	wof.open("utf8-out.txt");
	wof << str2;
	std::wcin.ignore();
}


The output is this:
1
2
3
zażółć gęślą jaźń
UTF8 File: zażółć gęślą jaźń
FGETWS Input: zażółć gęślą jaźń


Firstly - I read elsewhere on the Internet that one shouldn't use both _setmode and SetConsoleOutputCP/SetConsoleCP together in a code, so I deleted that part. However, no matter whether I delete this part or not, reading from a file does not succeed.
I don't understand what has happened.

Notwithstanding the above problem I thank you for your _setmode suggestion :)

EDIT: it occured to me suddenly that if one uses _setmode on stdin, to the best of my knowledge this "mode" is inherited by cin and wcin. I tried just that and UTF16 works using not only fgetws, but also wcin.

P.S I couldn't find any appropriate answer on the Internet, so a quick question - is there a way to not const declare how long a stdin buffer is? In an analoguous situation with declaring char input[1000]; std::cin >> input; everyone would try to discourage me from doing such a thing, so I wonder if more "dynamic" approach is available.

Sincerely.
Last edited on
reading from a file does not succeed.

read from a file using ifstream, not wifstream. Windows cannot make wifstream work.
Once you have your data in std::string, then you can do as you please: either print directly (through SetConsoleOutputCP, as in my example above) or convert to UTF16 in a std::wstring using std::codecvt_utf8 or MultibyteToWideChar and print through _O_U16TEXT
Last edited on
Topic archived. No new replies allowed.