But how. I use code::blocks and GCC. wstring dot't have getline function to take a line from file.And then there is not strtok for wide chars.
Whith string and getline i take lines. The file is already utf-8. Then?
//To get a wide line
wfstream fs8;
wstring line;
getline( fs8, line);
//To store words in a vector from wide string
vector<wstring>words;
wstring::size_type pos;
while (true)
{
pos = line.find(L' ');
if ( pos != wstring::npos )
{
words.push_back(line.substr(0,pos));
line.erase(0,pos+1);//notice that this will modify your starting string
}else
{
words.push_back(line);
break;
}
}
Last I checked, wstring doesn't do UTF-8. While STL streams are specifically designed to handle such things, the prescribed ones don't.
You need to convert the UTF-8 to the standard wchar_t strings. It isn't actually too difficult, but if all you want is a quick answer, I recommend you to the GNU iconv() library (libiconv) http://www.gnu.org/software/libiconv/
Once your UTF-8 string data is converted to a wstring, you can then use all the usual find() methods and string functions like getline() over wstringstreams.
Hope this helps.
[edit]
Hey, here's something that may be more useful:
I have ancient Greek text file in notepad at windows xp. This file can be saved as utf-8 or unicode. Actually opening in notepad++ can convert all this very easy. Here a code from user helios for converting.
But when wfstream fs8;wstring line;getline( fs8, line); file fs8 is already utf-8. The line isn't ?
Is something practical for doing this?
#define BOM8A 0xEF
#define BOM8B 0xBB
#define BOM8C 0xBF
wchar_t *UTF8_to_WChar(constchar *string){
long b=0,
c=0;
if ((uchar)string[0]==BOM8A && (uchar)string[1]==BOM8B && (uchar)string[2]==BOM8C)
string+=3;
for (constchar *a=string;*a;a++)
if (((uchar)*a)<128 || (*a&192)==192)
c++;
wchar_t *res=newwchar_t[c+1];
res[c]=0;
for (uchar *a=(uchar*)string;*a;a++){
if (!(*a&128))
//Byte represents an ASCII character. Direct copy will do.
res[b]=*a;
elseif ((*a&192)==128)
//Byte is the middle of an encoded character. Ignore.
continue;
elseif ((*a&224)==192)
//Byte represents the start of an encoded character in the range
//U+0080 to U+07FF
res[b]=((*a&31)<<6)|a[1]&63;
elseif ((*a&240)==224)
//Byte represents the start of an encoded character in the range
//U+07FF to U+FFFF
res[b]=((*a&15)<<12)|((a[1]&63)<<6)|a[2]&63;
elseif ((*a&248)==240){
//Byte represents the start of an encoded character beyond the
//U+FFFF limit of 16-bit integers
res[b]='?';
}
b++;
}
return res;
}
C++ streams have no concept of encoding characteristics --each element is considered an independent entity.
Hence, when you use any of the STL iostreams to read a UTF-8 sequence, it is not decoded into the proper characters. (Even the stinkin' wstream objects can't do that.)
For example, if you save the following, using Notepad (or Notepad++, presumably) with "UTF-8" in the encoding combobox of the Save As dialogue, you will get a little UTF-8 file, including the obnoxious BOM that Windows programs add to UTF-8 files.
Hello world! What's up?
¡Hola mundo! ¿Qué pasa?
Here is an example of how to use C++ to convert such a file into a wchar_t stream (string or file).
// utf8-to-wchar_t.cpp
//
// This program is an example of how to read a UTF-8 encoded file into a
// wchar_t sequence (be it a string or, as in this case, another file).
//
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
usingnamespace std;
//----------------------------------------------------------------------------
// Here's a little consumer-transformer following the STL design philosophy.
// Notice how, since UTF-8 is bound to specific bit-patterns, our types are
// only generic in what the input and output containers are.
//
// For more on the UTF-8 layout, see
//
// http://en.wikipedia.org/wiki/Utf8
//
// Specifically,
// 0xxxxxxx --> 00000000 00000000 xxxxxxxx
// 110yyyyy 10xxxxxx --> 00000000 00000yyy yyxxxxxx
// 1110zzzz 10yyyyyy 10xxxxxx --> 00000000 zzzzyyyy yyxxxxxx
// 11110www 10zzzzzz 10yyyyyy 10xxxxxx --> 000wwwzz zzzzyyyy yyxxxxxx
//
// Notice how the first form is identical to ASCII.
//
// This algorithm does NOT consider whether or not your wchar_t is large
// enough to hold a 21-bit character. (UTF-8 is specified over U+0000 to
// U+10FFFF. Most modern C++ compilers use a 32-bit wchar_t, particularly
// on Linux, but some older ones still have a 16-bit wchar_t, truncating
// the range to U+0000 to U+FFFF.)
//
template <
typename InputIterator,
typename OutputIterator
>
OutputIterator utf8_to_wchar_t(
InputIterator begin,
InputIterator end,
OutputIterator result
) {
for (; begin != end; ++begin, ++result)
{
int count = 0; // the number of bytes in the UTF-8 sequence
unsigned c = (unsignedchar)*begin;
unsigned i = 0x80;
// Skip the stupid UTF-8 BOM that Windows programs add
//
// (And yes, we have to do it here like this due to problems
// that iostream iterators have with multiple data accesses.)
//
// Note that 0xEF is an illegal UTF-8 code, so it is safe to have
// this check in the loop.
//
if (c == 0xEF)
c = (unsignedchar)* ++ ++ ++begin;
// Resynchronize after errors (which shouldn't happen)
while ((c & 0xC0) == 0x80)
c = (unsignedchar)*++begin;
// Now we count the number of bytes in the sequence...
for (; c & i; i >>= 1) ++count;
// ...and strip the high-code-bits from the character value
c &= i - 1;
// Now we build the resulting wchar_t by
// appending all the character bits together
for (; count > 1; --count)
{
c <<= 6;
c |= (*++begin) & 0x3F;
}
// And we store the result in the output container
*result = c;
}
// The usual generic stuff
return result;
}
//----------------------------------------------------------------------------
int complain( constchar* filename, constchar* method )
{
cerr
<< "I could not open the file \""
<< filename
<< "\" for "
<< method
<< endl;
return 1;
}
//----------------------------------------------------------------------------
// This little type is to help with actual wide streams (since the STL doesn't
// have any -- see widen() and narrow() for all the disappointing details).
//
struct widechar
{
typedefenum { big_endian, little_endian } endianness_t;
unsigned value;
widechar( unsigned value = 0 ): value( value ) { }
static endianness_t endianness() { return e; }
staticvoid endianness( endianness_t endianness ) { e = endianness; }
private: static endianness_t e;
};
widechar::endianness_t widechar::e = widechar::big_endian;
//............................................................................
ostream& operator << ( ostream& outs, widechar wc )
{
if (wc.endianness() == widechar::little_endian)
for (int i = 0; i < 4; ++i)
{
outs << (char)(wc.value & 0xFF);
wc.value >>= 8;
}
elsefor (int i = 24; i >= 0; i -= 8)
{
outs << (char)((wc.value >> i) & 0xFF);
}
return outs;
}
//----------------------------------------------------------------------------
int main( int argc, char** argv )
{
// If necessary, give the user instructions
if (argc < 3)
{
cout <<
"Convert a UTF-8 file to a wchar file.\n""usage:\n " << argv[ 0 ] << " UTF8-FILENAME WCHAR-FILENAME\n";
return 1;
}
// Otherwise, convert the named UTF-8 input file to the named wchar_t output
ifstream inf( argv[ 1 ], ios::binary );
ofstream outf( argv[ 2 ], ios::binary );
if (!inf) return complain( argv[ 1 ], "reading" );
if (!outf) return complain( argv[ 2 ], "writing" );
inf >> noskipws; // We want all data (including spaces, newlines, etc).
// This will help on Win32; the command prompt will display a little-endian
// stream correctly, but it will display a big-endian stream with some garbage.
widechar::endianness( widechar::little_endian );
outf << (widechar)0x0000FEFF; // byte order mark
// Here I use a iostream iterator directly, but any appropriate sequence
// container will do. You can convert std::strings or whatever you like
// in the usual way.
//
utf8_to_wchar_t(
istream_iterator <char> (inf),
istream_iterator <char> (),
ostream_iterator <widechar> (outf)
);
outf.close();
inf .close();
//..........................................................................
// Here's an example using a wstring sequence
//
// Again, iostream_iterators play havoc with streams, so we just reopen
// the file to play safe.
inf.open( argv[ 1 ], ios::binary );
inf >> noskipws;
// For each line of text...
string line;
unsigned line_number = 1;
while (getline( inf, line ))
{
// ...First convert it to a wstring
wstring wline;
utf8_to_wchar_t(
line.begin(),
line.end(),
back_insert_iterator <wstring> (wline)
);
// Then see if it has the Spanish leading-question mark (¿) in it
wstring::size_type index = wline.find( (wchar_t)0xBF );
cout << "line " << line_number << ": ";
if (index == wstring::npos)
cout << "the upside-down question-mark does not appear in this line.\n";
else
cout << "the upside-down question-mark is at index " << (index + 1) << "\n";
++line_number;
}
inf.close();
return 0;
}
// end utf8-to-wchar_t.cpp
This code just converts UTF-8 to wchar_t, it does not go the other way.
If you want to convert wchar_t to UTF-8, it is very much the same process (though a bit easier, since the input stream is not coded).