Special characters using fstream

Pages: 12
I'm having some problems with special norwegian letters. The following code works

1
2
3
4
5
6
7
8
9
10
11
12
#include <iostream>
#include <locale.h>
#include <fstream>
#include <string>

using namespace std;

int main(){
    setlocale(LC_ALL, "norwegian");
    cout << "æøå" << endl;
    return 0;
}


but when I try to read from file using fstream, 'Ø' turns into 'Ø', å turns into 'Ã¥', and so on.

How do I fix this?
Last edited on
If in has type std::ifstream then try to use


in.imbue( std::locale() );

before reading all other data.

Thank you for responce, but it did not work.
Use

in.imbue( std::locale( "norwegian" ) );
That did not work either. Does it work for you? If so, can you give me an examplecode?
What o/s are you using? And what compiler?

Also, I assume your file is UTF-8 encoded...

For Windows and Visual Studio 2010, this code reads and displays a UTF-8 encoded file. If you're using the MinGW version of GCC, you might have a problem as I don't link it fully implements locales (unlike the Linux version.)

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>  // for _fileno
#include <io.h>    // for _setmode
#include <fcntl.h> // for _O_U16TEXT

using namespace std;

void dump_file(const wstring& filePath) {
	// A Windows console will only display Unicode special characters if
	// the translation mode is set to UTF-16
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	// open the file as Unicode, so we can read into wstrings
	wifstream ifs(filePath);

	// imbue the file with a codecvt_utf8 facet which knows how to
	// convert from UTF-8 to UCS2 (the 2-byte part of UTF-16)
	// Note this is available in Visual C++ 2010 and later
	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	ifs.imbue(utf8_locale); 

	// Skip the BOM (this gets translated from the UTF-8 to the
	// UTF-16 version so will be a single character.)
	wchar_t bom = L'\0';
	ifs.get(bom);

	// Read the file contents and write to wcout
	wstring line;
	while(getline(ifs, line)) {
		wcout << line << endl;
	}

	// put the tranlation mode back to normal
	_setmode(_fileno(stdout), oldMode);

	cout << endl;
}

int main() {
	wstring filePath = L"limerick.txt";
	dump_file(filePath);
	return 0;
}


Where limerick.txt is a UTF-8 text file containing

En limerick skal være på fem linjer, hvor første,
andre og femte linje har samme enderim og består
av tre verseføtter. Tredje og fjerde er kortere
med to verseføtter, og de deler enderim.

(which is also displayed correctly by the console.)
Last edited on
Nice! It works perfectly fine reading from file now. My only problem now is to write this to a new file :P When I try that, it stops writing to file as soon as it hit's the first letter of the kind 'æ, 'ø' 'å' etc....
How are you trying to write to the file?

(Posting minimal but complete code would prob be most helpful here.)

Andy
Last edited on
Never mind, I forgot to write
input.get(bom);
on the second place in my code. Silly me :)

Thanks for all help!
:-)

I take it your o/p file ended up UTF-8, as you hoped??

Andy
Last edited on
Yes, but now I discovered a new bug. It wont read signs like "–", like it did before. What happened?

This is my code so far

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	wifstream input(L"LaTeXHeader.txt");
	wofstream output(L"messagesConverted.txt");

	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	input.imbue(utf8_locale); 

	wchar_t bom = L'\0';
	input.get(bom);

	wstring line;
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();	
	_setmode(_fileno(stdout), oldMode);
	input.open(L"messages.txt");

	input.get(bom);
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();
	output.close();
	_setmode(_fileno(stdout), oldMode);
	
	return 0;
}
Last edited on
Seems like I found a solution, without knowing what it was :P
I'll have a look a bit later...

But what are you hoping to do? Read a UTF-8 file in and then write it out as UTF-16 ??

Or what?

Andy
Last edited on
No, I think I want to read a UTF-16 and the write it out as UTF-16. I want the program to be able to handle all signs in the document. Including signs like ❤.

Not sure what type of document I have, how do I figure it out?
If my previous code worked, your file is UTF-8 -- the 'Ø' and 'Ã¥' are UTF-8's way of handling 'Ø' and å

If you're using Windows, which I presume you are, open the text file with notepad the then do "Save As". The encoding the file is using will be displayed in the combobox at the bottom of the dialog.

Alternatively, open the text file with a hex viewer:
- a normal Windows text file (extended ASCII) file will use one byte per character, including 'Ø' and 'å'
- a UTF-8 file will use one byte per normal character but two for (e.g.) 'Ø' and 'å', and should begin with the Byte Order Mark (in hex) EF BB BF
- a little-endian UTF-16 file will use two bytes per character and should begin with the Byte Order Mark (in hex) FF FE

Andy
Last edited on
Right, my file is UTF-8. Is it then impossible to get signs like "❤" then?
Is it then impossible to get signs like "❤" then?

UTF-16 can deal with them, too.

Andy
Repaired version #1 -- which reads and writes UTF-8

The repair was to imbue the output, as well as the input, and write the BOM to the file.

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	wifstream input(L"LaTeXHeader.txt");
	wofstream output(L"messagesConverted.txt");

	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	input.imbue(utf8_locale); 
	output.imbue(utf8_locale); // Also imbue output

	wchar_t bom = L'\0';
	input.get(bom);

	output << L'\xFEFF'; // write BOM

	wstring line;
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();

	// don't reset mode till later
	//_setmode(_fileno(stdout), oldMode);

	input.open(L"messages.txt");
	input.get(bom);
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}

	input.close();
	output.close();

	_setmode(_fileno(stdout), oldMode);

	return 0;
}
Last edited on
And this version reads and write Unicode (little-endian)

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include <iostream>
#include <fstream>
#include <string>
//#include <locale>
//#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	FILE* fp_in  = _wfopen(L"LaTeXHeader.txt", L"r,ccs=UNICODE");
	FILE* fp_out = _wfopen(L"messagesConverted.txt", L"w,ccs=UNICODE");

	wifstream input_1(fp_in);
	wofstream output(fp_out);

	// For UTF-16, don't imbue
	//locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	//input.imbue(utf8_locale); 
	//output.imbue(utf8_locale); // Also imbue output

	// BOM handled automatically
	//wchar_t bom = L'\0';
	//input_1.get(bom);

	// BOM handled automatically
	//output << L'\xFEFF'; // write BOM

	wstring line;
	
	while (!input_1.eof()){
		getline(input_1, line);
		output << line << endl;
	}
	//input.close();
	fclose(fp_in);

	// don't reset mode till later
	//_setmode(_fileno(stdout), oldMode);

	//input.open(L"messages.txt");
	fp_in  = _wfopen(L"messages.txt", L"r,ccs=UNICODE");
	wifstream input_2(fp_in);
	// BOM handled automatically
	//input_2.get(bom);
	
	while (!input_2.eof()){
		getline(input_2, line);
		output << line << endl;
	}

	//input.close();
	fclose(fp_in);
	output.close();

	_setmode(_fileno(stdout), oldMode);

	return 0;
}
PS This

1
2
3
4
5
6
	wstring line;

	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}


is better written as

1
2
3
4
5
	wstring line;
	
	while (getline(input, line)) {
		output << line << endl;
	}


Andy
Pages: 12