Help on text convertion to utf-8

Hi mates,
May I get some help direction on constructing function to convert
text string to utf-8 encoded one?

I adopted some code for the reverse process
by gathering ideas from here,
but frankly speaking w/o complete understanding of the code used.


> function to convert text string to utf-8 encoded one?

No conversion is required; in a sequence of bytes (char), whether a. each byte represents a distinct character, or b. sub-sequences of one or more bytes represents a single character, is merely a matter of interpretation. In standard C++, this interpretation is usually done by the codecvt facet of the locale in effect.

The type of a plain string literal "hello\\U000031F3" and a UTF-8 encoded string literal u8"helloć‡³"
are both array of const char - char[].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <iostream>
#include <string>
#include <locale>
#include <fstream>

int main()
{
    // no conversion is required, just use std::string to hold the bytes in a multi-byte utf-8 string     
    const std::string str = "a \U00000062" // one byte (octect) each
                            " \U000000BE \u011C" // two bytes each (one byte each for space)
                            " \u20AC \U000031F3"  ; // three bytes each (one byte for space)
                        
    
    // input and output work as expected, if we set the stream's locale to a utf-8 locale
    std::cout.imbue( std::locale( "C.UTF-8" ) ) ; // set the stream's locale to UTF-8
    std::cout << str << '\n' ;
    
    std::ifstream this_file( __FILE__ ) ;
    this_file.imbue( std::locale( "C.UTF-8" ) ) ; // set the stream's locale to UTF-8
    std::string line ; 
    for( int i = 0 ; i<5 ; ++i ) if( std::getline( this_file, line ) ) std::cout << i << ". " << line << '\n' ;
    
    std::locale::global( std::locale( "C.UTF-8" ) ) ; // set the default (global) locale if we want utf-8 for all new streams
    std::ofstream( "test_utf8.txt" ) << "file test_utf8.txt: " << str << '\n' ; // the newly-construct stream imbues the global locale
    
    // however, string operations size(), [], substr() etc. operate on bytes and not utf-8 characters
    // and string iterators iterate over each byte, not each utf-8 character.
    std::cout << "size in bytes: " << str.size() << '\n' ; // size in bytes, not characters
    unsigned char c = str[6] ; std::cout << "byte at str[6]: " << std::hex << std::showbase << int(c) << '\n' ; // byte, not character
    for( unsigned char  byte : str ) std::cout << int(byte) << ' ' ; // iterates over bytes, not characters (note: byte-order)
    std::cout << '\n' ;
}

http://coliru.stacked-crooked.com/a/e372bfdbb070fe15

C++11 does not have convenient mechanisms to access the individual utf-8 characters in a sequence of char, or to take care of byte-ordering and BOM markers seamlessly. There are many libraries floating around that make this possible; a library that uses idiomatic C++ constructs would make things easy.

For instance, with UTF8-CPP http://utfcpp.sourceforge.net/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream>
#include <string>
#include "utf8.h"
#include <iomanip>

utf8::iterator< std::string::const_iterator > utf8_begin( const std::string& str )
{ return utf8::iterator<std::string::const_iterator>( str.begin(), str.begin(), str.end() ) ; }

utf8::iterator< std::string::const_iterator > utf8_end( const std::string& str )
{ return utf8::iterator<std::string::const_iterator>( str.end(), str.begin(), str.end() ) ; }

int main()
{
    // no conversion is required, just use std::string to hold the bytes in a multi-byte utf-8 string
    const std::string str = "a\U000000A1" // one byte each
                            "\U000000BE\u011C" // two bytes each
                            "\u20AC\U000031F3 "  ; // three bytes each (one byte for space)

    for( auto utf8_iter = utf8_begin(str) ; utf8_iter != utf8_end(str) ; ++utf8_iter )
    {
        const char32_t code_point = *utf8_iter ;
        const auto str_iter = utf8_iter.base() ;
        std::cout << std::hex << "\\U" << std::setw(4) << std::setfill('0') << code_point
                  << " starting at byte offset " << std::dec << std::setw(2) << std::setfill(' ')
                  << str_iter - str.begin() << '\n' ;
    }
}

\U0061 starting at byte offset  0
\U00a1 starting at byte offset  1
\U00be starting at byte offset  3
\U011c starting at byte offset  5
\U20ac starting at byte offset  7
\U31f3 starting at byte offset 10
\U0020 starting at byte offset 13


Edit: boost::locale::boundary in the heavyweight Boost.Locale http://www.boost.org/doc/libs/1_58_0/libs/locale/doc/html/boundary_analysys.html
has more functionality. Unlike UTF8-CPP which is header-only, Boost.Locale must be built.
Last edited on
Thanks to all of You guys for the attention and time taken in my still-true-beginner-at-C++ topic.

I have some understanding of C, and some hobby practice - that's all.
C++ is new to me in all aspects and I can't rely to run fast after reading a book on it and the tutorials on this cite. Operators overloads, templates of functions and classes might be easy to swallow in general terms, but when I look at their real implementation into libraries and references it is quite difficult at this point.

You see now that I can hardly take good advantage of your solid and experienced advices given generously. So maybe it is better to just explain what I am doing not to confuse you further and waste your time, and after all that to get back to basics of libraries and references here, can't learn that fast really.

I got interested in building a program that facilitates me compose tracks in Google Earth and a like apps ( Oruxmap on Android ). My resource is a folder with kml-files - simple utf-8 encoded tracks build manually in Google Earth environment or automatically while in motion with handheld smartphone with gps ( before mentioned app does that quite well ). All those tracks present a spider net if simultaneously opened in Google Earth. Goal idea was to automate the process of composing new track connecting any two nodes on my tracks-net, under the simple criteria of minimizing the distance traveled.

So ... I registered here, downloaded MS Community 2013 and got coding, mostly the C-way, without nested classes, just functions operating over statically reserved database. I did it with some recursivity and became glad of the result. It was working fine with the test database. Problem was importing real data, and exporting it after the manipulation.

I copy-pasted some code to help me reading the directory with files,
then again some to decode their utf-8 to text ( big thanks to Duoas here! ), and finally I needed some code to encode the solution track to utf-8 back again ( reason for this topic ).

It came out that for my machine: sizeof() gives 1 for char, unsigned char and signed char, 2 for wchar_t and 4 for unsigned and I had some time in types/files conversion tactics. But the really good news was that those simple kml files I use are consisted of standard ASCII characters after all, or a byte of utf-8 w/o need to be encoded/coded - my program works quite well for my hobby standards exporting directly the text solution file into kml file. About two weeks for 500-600 lines of code :)

Finally, I will rely further to my technical curiosity ( not being pro-programmer ), to explore your advices and methods for char sets conversion/ strings manipulation, might need them for something else, hard thinking is pleasureful sometimes you know :)

Many thanks again!
Best Regards, K.
Topic archived. No new replies allowed.