Reading and printing accented characters

Hey guys!

I am writing a program that reads a text file which has some accented characters sprinkled across it. I was able to successfully read and display the file; however, all the accented characters came out wrong (displaying some Greek/mathematical symbols instead)! Shown below is the program; followed by the contents of the text file:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <iostream>
#include <fstream>
#include <queue>
using namespace std;

void loadFile(ifstream& inp, queue<string>& qList);

int main()
{
    queue<string> queueList; 
    ifstream infile;
    infile.open("accentedChar.txt");
    
    
    
    loadFile(infile, queueList);
    while (!queueList.empty())
    {
        cout<<queueList.front()<<endl;
        queueList.pop();
    }
    
    cout<<endl<<endl;
    // wait until user is ready before terminating program
    system("PAUSE");
    return 0;
}


void loadFile(ifstream& inp, queue<string>& qList)
{
    string str;
    
    while(inp)
    {
        getline(inp, str);
        qList.push(str);
    }
}



el tipo sabe cómo dar una bienvenida
todo lo demás
sin daño, no hay culpa
tal engreído y fatuo


How do I go about reading and displaying these accented characters?
Last edited on
The standard advice to display Unicode on a Windows console is: don't do it. Too many things need to go right to do it properly, and not all of them can be controlled from your program. It's so unreliable that it's practically never worth the effort.
Trust me, I speak from experience. Just forget it.

EDIT: Wow. My thought patterns seem to be really consistent. Here I am giving the same advice with almost the same phrasing one year earlier:
http://www.cplusplus.com/forum/beginner/168354/
Last edited on
Everything has to do with encoding, which is to say, the "character set" in use.

I presume you are on Windows.
(Life is easier on *nixen; stick to UTF-8.)

First, your Windows Console must be configured to use a Unicode font. Unfortunately, there isn't any panacea answer, but for what you want "Lucida Console" should suffice.

Next, your console must be configured for the appropriate "code page". The default in English-speaking countries is typically either 437 (OEM US) or 1252 (ANSI Latin 1--Western European). UTF-8 is 65001. A list of code pages can be found on MS's site:
http://www.google.com/search?btnI=1&q=msdn+Code+Page+Identifiers

You can change it with the chcp shell command (or by using the SetConsoleOutputCP() function):
chcp 65001
SetConsoleOutputCP( 65001 );
This is not guaranteed to work, however.

Once you set the output code page, you must write using that character set. This may or may not match the character set in your input file, so you must be careful to attend to that.

Good luck!
Thx for your reply guys! There is a discernible theme of pessimism and resignation from your informed replies; seems like achieving what I intended would be an uphill task, if at all possible. Nevertheless, I want to address some of the issues Douas raised.

Yes, I'm using Windows: Windows 7 to be exact.

You wrote:
... but for what you want "Lucida Console" should suffice.
What is a Lucida Console?
Does this correlate in anyway with the general field of Lucida typefaces or is there a special hardware significance to this?

Is it the Command Prompt window one uses for the chcp shell command?

How do I know from my computer what code page it is using currently? And if I decide to change it, I'm guessing the process is reversible, right?
How do I go about reading and displaying these accented characters?

Unlike Linux, which supports Unicode natively, Windows will make you work for it.

This is the smallest code that works for me right now (my console happens to use the Lucida Console font by default, and the only other option, Consolas, also works). I use Visual Studio 2015, but codecvt stuff was there since Visual Studio 2010.

If the file is stored in UTF-16le with a BOM (choose "Unicode" in the Save As dialog in Notepad, even though I take issue with calling that "Unicode")

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>
#include <fstream>
#include <codecvt>
#include <io.h>
#include <fcntl.h>

int main() {
	std::wifstream f("D:\\test.txt", std::ios::binary);
	f.imbue(std::locale(f.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
	std::wstring str(std::istreambuf_iterator<wchar_t>(f), {});
	_setmode(_fileno(stdout), _O_WTEXT);
	std::wcout << str << '\n';
}


if the file is stored as UTF-8 with a BOM (choose UTF-8 option in the Save As dialog in Nodepad) - the only difference is that I'm using codecvt_utf8 instead of codecvt_utf16

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>
#include <fstream>
#include <codecvt>
#include <io.h>
#include <fcntl.h>

int main() {
	std::wifstream f("D:\\test.txt", std::ios::binary);
	f.imbue(std::locale(f.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>));
	std::wstring str(std::istreambuf_iterator<wchar_t>(f), {});
	_setmode(_fileno(stdout), _O_WTEXT);
	std::wcout << str << '\n';
}


If you're up to it, you can find how to tell UTF-16 apart from UTF-8 from within your program (check BOM before conversion or use a library that can do that for you). And steer clear of what Windows nonsensically calls "ANSI".
Last edited on
It is a pain, especially proper input, but it is doable.
How you do it depends entirely on your compiler.

On Windows, use MSVC.

I'm moving right now, so I don't have time to give you anything substantial, but you can start with the code found here: https://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/

Be aware that his code has a couple of subtle flaws, but will work as-is for most of what you want to do. If you want to fix the Ctrl-Z handling, you can, but that might be over your head... What he has works for most use cases.

(I also dislike his boiler-plate suit-approved coding style, but that's me.)

EDIT: also, he's wrong about UTF-8. Use UTF-8 internally. The only time you need to convert is when reading and writing the text. To get you started:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  std::wistream& operator >> ( std::wistream& ins, std::string& s )
  {
    std::wstring ws;
    ins >> ws;
    s = std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> ().to_bytes( ws );
    return ins;
  }
  
  std::wostream& operator << ( std::wostream& outs, const std::string& s )
  {
    return outs << std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> ().from_bytes( s );
  }
  
  std::wistream& getline( std::wistream& ins, std::string& s, wchar_t delimiter )
  {
    std::wstring ws;
    std::getline( ins, ws, delimiter );
    s = std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> ().to_bytes( ws );
    return ins;
  }


Good luck!
Last edited on
Thanks Cubbi and Douas; I'll look into the "solutions" you offered. And, very likely, hit you with more questions later. LOL. Good luck with moving, Douas!
What language is the file actually?
A while ago I had problems to display German texts properly. The way to fix was simply to change the locale.
Last edited on
@Thomas1965,

The language of the file I'm trying to read is Spanish.

Did the German texts you displayed have accented characters?
Did the German texts you displayed have accented characters?

It had some German umlauts - äö etc.

Try this:
1
2
3
4
#include <locale.h>
setlocale(LC_ALL, "spanish");
or
setlocale(LC_ALL, "esp");

Locales have absolutely nothing to do with how the console displays text.
Locales modulate the way your program handles text.

On Windows, either use WriteConsoleW() or use wcout after _setmode( 1, 0x40000 );. Use MSVC.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// ← This file must have a UTF-8 BOM!
// (VS2015 Update 2 does not need it, but earlier versions do.)

#include <iostream>

#ifndef NOMINMAX
#define NOMINMAX
#endif
#include <windows.h>
#include <io.h>
#include <fcntl.h>

int main()
{
  _setmode( 1, _O_U8TEXT );
  _setmode( 2, _O_U8TEXT );

  std::wcout << L"Y ¿qué has visto en nosotros, de repente?\n";
}  

Again, getting Unicode (BMP) input from the console requires special attention as per the article I linked for you.
Locales have absolutely nothing to do with how the console displays text.

Why then does it display the German umlauts then correct and before not ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <string>
#include <locale.h>

using namespace std;
 
int main()
{
  string text = "Sie können ihren ausweis dort verlängern.";
  cout << "Trying to output German text...\n";
  cout << text;
  cout << "\nSetting locale now to german";
  setlocale(LC_ALL, "german");
  cout << "\nTrying to output German text...\n";
  cout << text;
  system("pause");
  return 0;
}


OUTPUT
1
2
3
4
5
Trying to output German text...
Sie k÷nnen ihren ausweis dort verlõngern.
Setting locale now to german
Trying to output German text...
Sie können ihren ausweis dort verlängern.Press any key to continue . . .

The Windows function that performs the conversion from multi-byte encodings to UTF-16 before passing its result to the output function, decides in which encoding its input is based on the current locale.

If you output UTF-16 from the beginning like Duoas says, you bypass this problem.
For some reason, Thomas1965's suggestion has resolved my problem! As a matter of fact, just one line from the code snippet he provided did the trick:

setlocale(LC_ALL, "spanish");

I intentionally did not even add the header file #include <clocale> and it still worked! Why?

Anyway, the manner of this problem resolution has sort of filled me with some understandable mixed feelings, especially against the backdrop of the dense code blocks offered by Douas and Cubbi. The crucial question I have is: How could this be so simply resolved when programming gurus provided what, with all due respect, looked like 'overkill'? Or is there something I am missing here?

By the way, I am using Dev C++ and, Douas, I checked the links you helpfully provided.
Last edited on
Your program will not work if compiled on a computer with a different locale than yours, for one. It will also not work if, further down the line, you need to use characters from two different and incompatible locales.
@ helios:
Are you implying that theirs (Douas' and Cubbi's) would work for all/multiple locales?
Cubbi's code doesn't have any special characters, it just reads data from a file, so the compiler is irrelevant.
Duoas' code will work as long as the source is encoded in UTF-8.
Topic archived. No new replies allowed.