C++ with Boost, Windows 7, read UTF-8 file with äöü special characters

Hello,

I have the following problem:

Let's pretend we have two text files:
1
2
list_utf8.txt (UTF-8 coded, 30 bytes in size)
list_ansi.txt (ANSI coded, 24 bytes in size)


Both contain just one line:
C:\Testfile with äöü.dat

This file does really exist in C:\Testfile with äöü.dat

Now I read both files with Boost, print the input and check if the file does exist. This is my outcome:
(Picture)
http://home.arcor.de/gabbafrog/ccc.png

As you can see, it doesn't print the special chars correctly (this isn't that important), but what the biggest problem is: It doesn't find the file, when it reads from the UTF-8 file!

What do I have to do or change in the code to make it work, so that it finds the existing file?

I guess I have to change something with "string"? Or do I have to tell boost in some way, that the file being read is UTF-8 format? The reason I use boost is that I want the program to be portable (Windows and Linux). Please help me, I am a beginner in C++.

Thank you!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <iostream>
#include <boost/filesystem.hpp>
#include <boost/filesystem/fstream.hpp>
#include <boost/algorithm/string/predicate.hpp>
 
using namespace std;
using namespace boost::filesystem;
 
int main(int argc, char* argv[])
{
    path p_list_utf ("C:\\list_utf8.txt");
    boost::filesystem::ifstream inFileUTF(p_list_utf);
    while (inFileUTF)
    {
        string s;
        getline(inFileUTF,s);
        if (inFileUTF)
        {
            cout << p_list_utf << endl << s << endl;
            if ( exists(s) )
            {
                cout << "File exists!" << endl;
            }
            else
            {
                cout << "File DOES NOT exist!" << endl;
            }
        }
    }
 
    path p_list_ansi ("C:\\list_ansi.txt");
    boost::filesystem::ifstream inFileANSI(p_list_ansi);
    while (inFileANSI)
    {
        string s;
        getline(inFileANSI,s);
        if (inFileANSI)
        {
            cout << p_list_ansi << endl << s << endl;
            if ( exists(s) )
            {
                cout << "File exists!" << endl;
            }
            else
            {
                cout << "File DOES NOT exist!" << endl;
            }
        }
    }
 
    return 0;
}

Last edited on
the filesystem library is only there to provide unified interface to the filesystem operations (directory iteration, file properties, etc), it doesn't perform character encoding conversions for you.. Although if you give it a wide string it will do its best

Unfortunately, the way Windows goes about character encodings is different from other OSes, so if you want to be portable, you will end up with quite a few #ifdefs (although boost.locale can help here)

Here's a Linux version using plain C++ for the conversions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <iostream>
#include <locale>
#include <boost/filesystem.hpp>
#include <boost/filesystem/fstream.hpp>

namespace fs = boost::filesystem;

int main()
{
    std::locale::global(std::locale("en_US.utf8"));
    std::wcout.imbue(std::locale());

    fs::path p_list_utf ("list_utf8.txt");
    fs::wifstream inFileUTF(p_list_utf);
    inFileUTF.imbue(std::locale("en_US.utf8"));
    for(std::wstring s; getline(inFileUTF,s); )
    {
        std::wcout << p_list_utf << '\n' << s << '\n';
        if (fs::exists(s))
            std::wcout << "File exists!\n";
        else
            std::wcout << "File DOES NOT exist!\n";

    }

    fs::path p_list_ansi ("list_ansi.txt");
    fs::wifstream inFileANSI(p_list_ansi);
    inFileANSI.imbue(std::locale("en_US.iso88591"));
    for(std::wstring s; getline(inFileANSI,s); )
    {
        std::wcout << p_list_ansi << '\n' << s << '\n';
        if ( fs::exists(s) )
            std::wcout << "File exists!\n";
        else
            std::wcout << "File DOES NOT exist!\n";
    }
}

test:
$ hexdump -C list_utf8.txt 
00000000  54 65 73 74 66 69 6c 65  20 77 69 74 68 20 c3 a4  |Testfile with ..|
00000010  c3 b6 c3 bc 2e 64 61 74  0a                       |.....dat.|
00000019
$ hexdump -C list_ansi.txt 
00000000  54 65 73 74 66 69 6c 65  20 77 69 74 68 20 e4 f6  |Testfile with ..|
00000010  fc 2e 64 61 74 0a                                 |..dat.|
00000016
$ ls -l Testfile\ with\ äöü.dat 
-rw-r--r-- 1 cubbi cubbi 0 Feb  5 22:03 Testfile with äöü.dat
$ ./test
"list_utf8.txt"
Testfile with äöü.dat
File exists!
"list_ansi.txt"
Testfile with äöü.dat
File exists!


I'll post a Windows version if I can put one together in a bit.
Last edited on
Windows version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <iostream>
#include <locale>
#include <fstream>
#include <codecvt>
#include <string>
#include <boost/filesystem.hpp>
#include <boost/filesystem/fstream.hpp>
#include <fcntl.h>
#include <io.h>
namespace fs = boost::filesystem;
int main()
{
    _setmode(_fileno(stdout), _O_WTEXT); // for output

    fs::path p_list_utf("list_utf8.txt");
    // there is no Unicode locale in Windows, but there is C++11 locale-independent Unicode
    fs::ifstream inFileUTF(p_list_utf);
    std::wbuffer_convert<std::codecvt_utf8<wchar_t>> inFilebufConverted(inFileUTF.rdbuf());
    std::wistream inFileConverted(&inFilebufConverted);
    for(std::wstring s; getline(inFileConverted, s); )
    {
        std::wcout << p_list_utf.c_str() << '\n' << s << '\n';
        if (fs::exists(s))
            std::wcout << "File exists!\n";
        else
            std::wcout << "File DOES NOT exist!\n";
    }
	
    fs::path p_list_ansi ("list_ansi.txt");
    fs::wifstream inFileANSI(p_list_ansi);
    for(std::wstring s; getline(inFileANSI,s); )
    {
        std::wcout << p_list_ansi.c_str() << '\n' << s << '\n';
        if ( fs::exists(s) )
            std::wcout << "File exists!\n";
        else
            std::wcout << "File DOES NOT exist!\n";
    }
}

02/05/2013  10:18 PM                23 list_ansi.txt
02/05/2013  10:16 PM                26 list_utf8.txt
02/05/2013  10:19 PM                 0 Testfile with äöü.dat
\Debug>ConsoleApplication5.exe
list_utf8.txt
Testfile with äöü.dat
File exists!
list_ansi.txt
Testfile with äöü.dat
File exists!


PS: Do look into boost.locale: http://www.boost.org/doc/libs/release/libs/locale/doc/html/index.html
Last edited on
WOW!
This REALLY helped me alot!

Thank you so much for the hard work giving me those two examples for Windows and Linux! I will adapt my main program to your solution :)
Last edited on
This works really good! But I have two more questions:

If the program can't find a file I have to enter the correct path (in the command prompt).

I have tried it the following way:

std::wstring newpath;
std::getline(std::wcin, newpath);
std::wcout << "Your file: " << newpath << std::endl;

But this doesn't work, it again prints bad characters for "äöü.." Can you tell me, how I can convert the "newpath" variable so it contains the "correct" characters?

I tried it but this whole codepage stuff is really very confusing for me, here
http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/collate_8cpp-example.html
I found an example where the input of special chars works. But I didn't manage to make the code work within my program..

Another question: I read that C++0x does have native Unicode support, does that mean I wouldn't have those those problems with C++0x?

Thank you again!
Last edited on
For wide character input from cin on Windows, try _setmode(_fileno(stdin), _O_WTEXT); (I won't be able to test for a while, so it's only a suggestion)

I read that C++0x does have native Unicode support

C++11 has native Unicode string literals and locale-independent Unicode conversions, one of which I used in the windows example above (from UTF-8 to native wide character format). It won't help you interpret keyboard input on Windows without a windows API call such as _setmode, I am afraid.
_setmode(_fileno(stdin), _O_WTEXT); doesn't work here :(
Every special char (äöü) becomes a "???"
Works for me, Visual Studio 2012:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <iostream>
#include <string>
#include <fcntl.h>
#include <io.h>

int main()
{ 
    std::wstring newpath;

    _setmode(_fileno(stdin), _O_WTEXT); // for input
    std::getline(std::wcin, newpath);

    _setmode(_fileno(stdout), _O_WTEXT); // for output
    std::wcout << "Your file: " << newpath << '\n';
}


test with äöü
Your file: test with äöü
Press any key to continue . . .
Topic archived. No new replies allowed.