wchar_t and reading unicode strings

Forum

Forum
General C++ Programming
wchar_t and reading unicode strings

wchar_t and reading unicode strings

Hello. I am a newbie in c++. I've a txt file with ancient chars, utf-8 format, and must be open all words in string that keep Unicode chars for working.
I am trying few days and i can't made it, using many codes from around. All codes opens the file, but i must put words in strings, and all that words are ancient.
I am using code::blocks IDE and GNU GCC compiler. Any idea?
Thank’s
Jim

#include <iostream>
#include <string>
#include <fstream>
#include <vector>
using namespace std;
struct dataLine {
string gNames;
string sNames;
};
int main (int argc, char * const argv[]) {
vector<dataLine*> dataList;
ifstream infile("input.txt");
if (!infile) {
std::cout<<"\nThe file was not successfully opened. Please check it exists." << endl;
return 1;
}
string line = "";
while (getline(infile, line)) {
int spaceLocation = line.find_first_of(' ');
dataLine *newLine = new dataLine(); // Allocate memory for new line
newLine->gNames = line.substr(0, spaceLocation);
newLine->sNames = line.substr(spaceLocation+1, line.length()-spaceLocation);

dataList.push_back(newLine); // Add the new line to our vector
cout << "Read: " << newLine->gNames << " - " << newLine->sNames << endl;
}
infile.close();
// Free the memory to prevent a leak
for (unsigned i = 0; i < dataList.size(); ++i) delete dataList[i];
return 0;//
}
Also with

// Copy a file
#define _UNICODE // Tell C we're using Unicode, notice the _
#include <tchar.h> // Include Unicode support functions
#include <fstream.h>
#include <stdlib.h>

using namespace std;
/*το αρχείο test.txt είναι αρχείο unicode κειμένου·
*/
unsigned char* ASCIItoUNICODE (unsigned char ch);
unsigned int* ConvertString (char *string);
unsigned int* ConvertString (unsigned char *string);
void UnicodePrint(unsigned int* Message);

int main () {
//unsigned char *text ;
//unsigned int *UnMess;
printf("Content-type: text/plain\n\n\n");
char * buffer;
long size;
wchar_t * wcstr;
ifstream infile ("test.txt",ifstream::binary);
ofstream outfile ("new.txt",ofstream::binary);
// get size of file
infile.seekg(0,ifstream::end);
size=infile.tellg();
infile.seekg(0);
// allocate memory for file content
buffer = new char [size];
// read content of infile
infile.read (buffer,size);
// write to outfile
// outfile.write (buffer,size);
//for(long i=0;i<size;i++)outfile << (char)((int)buffer[i]); // εδώ ξαναδημιουργείται το αρχείο ακριβώς το ίδιο με διπλή μετατροπή
//for(long i=0;i<size;i++)outfile << buffer[i];// εδώ ξαναδημιουργείται το αρχείο ακριβώς το ίδιο χωρίς μετατροπή
for(long i=0;i<size;i++)outfile << (int)buffer[i];// εδώ γράφονται οι ASCII χαρακτήρες σε ANSI αρχείο
//for(long i=0;i<size;i++)outfile << buffer[i]<< "="<< (int)buffer[i]<< "\n"; // εδώ έχουμε συνδυασμό· το αρχείο είναι ANSI
// release dynamically-allocated memory
//wchar_t mystring[] = _TEXT("buffer");size= sizeof(mystring);for(long i=0;i<size;i++)outfile << mystring[i]<<"\n";
// outfile.write (mystring,size);
//UnMess = ConvertString(buffer);
//for(long i=0;i<size;i++)outfile << UnMess[i]<<"\n";// εδώ ξαναδημιουργείται το αρχείο ακριβώς το ίδιο χωρίς μετατροπή
outfile.close();
infile.close();
delete[] buffer;
//size_t n=mbstowcs (wcstr, buffer, size);
return 0;}

unsigned char* ASCIItoUNICODE (unsigned char ch)
{unsigned char Val[2];
if ((ch < 192)&&(ch != 168)&&(ch != 184)) {Val[0] = 0; Val[1] = ch; return Val;}
if (ch == 168) {Val[0] = 208; Val[1] = 129; return Val;}
if (ch == 184) {Val[0] = 209; Val[1] = 145; return Val;}
if (ch < 240) {Val[0] = 208; Val[1] = ch-48; return Val;}
if (ch < 249) {Val[0] = 209; Val[1] = ch-112; return Val;}}

unsigned int* ConvertString (unsigned char *string)
{unsigned int size=0, *NewString;
unsigned char* Uni;
while (string[size++]!=0);
NewString = (unsigned int*)malloc(sizeof(unsigned int)*2*size-1);
NewString[0]=2*size-1;
size=0;
while (string[size]!=0)
{Uni = ASCIItoUNICODE(string[size]);
NewString[2*size+1]=Uni[0];
NewString[2*size+2]=Uni[1];
size++;}return NewString;}

unsigned int* ConvertString (char *string)
{unsigned int size=0, *NewString;
unsigned char* Uni;
while (string[size++]!=0);
NewString = (unsigned int*)malloc(sizeof(unsigned int)*2*size-1);
NewString[0]=2*size-1;
size=0;
while (string[size]!=0)
{Uni = ASCIItoUNICODE(string[size]);
NewString[2*size+1]=Uni[0];
NewString[2*size+2]=Uni[1];
size++;}return NewString;}

kbw (9492)

1. UTF-8 is a Multi Byte Character Set. As a recall, the first 127 entries are 1 byte characters and match ASCII, and the remainder are 2 bytes. So when reading them from a file, you need to read a byte, and check if it's part of a two byte sequence, and if it is read it; then you're ready to process a character.

2. You'd help yourself if your produced a UTF-8 string class, and confine all the UTF-8 stuff there, freeing up the rest of the code from that burden. You could test it in isolation and convert to/from UNICODE and to/from a byte stream in a 'clean' environment.

dkaip (196)

Thank's i'll try and send the stuff.

Disch (13742)

UTF-8 characters can actually be up to 4 bytes long. kbw is correct about the first 127 being ascii compatible.

I found that wikipedia's article on UTF-8 is informative on this subject and easy to understand:

http://en.wikipedia.org/wiki/Utf-8#Description

wchar_t is effectively UTF-16 which is 2 bytes wide, but can be 4 bytes (2 characters) if the code is above U+FFFF. To support these codes (which are extremely rare in my experience) you need to use surrogate pairs. Again I'll defer to wikipedia's explanation, as it is pretty well explained:

http://en.wikipedia.org/wiki/UTF-16/UCS-2#Encoding_of_characters_outside_the_BMP

In my experience, lib functions that take wchar_t* strings handle surrogate pairs correctly

Last edited on

helios (17607)

UTF-8 characters can actually be up to 4 bytes long.

A value encoded as UTF-8 can be n bits long.
Unicode is a subset of UCS, and covers code points U+0000 to U+10FFFF. It would take more than 4 bytes to encode 0x10FFFF in UTF-8.

wchar_t is effectively UTF-16 which is 2 bytes wide

The size of wchar_t is implementation-dependent, and is only guaranteed to be no smaller than char. wchar_t could even be a 7-bit integer, by this definition.

In my experience, lib functions that take wchar_t* strings handle surrogate pairs correctly

The wide versions of the str* functions -- such as wcslen(), wcscmp(), etc. -- assume the string to be encoded flatly, so to speak. I.e. two wide characters {0xD800,0xDC00} (the UTF-16 representation of U+10000) are interpreted as the individual code points U+D800 and U+DC00, regardless of whether they are valid or not.

dkaip (196)

I am trying to undrestud all this things. The matter is unknown to me. Just looking for utf-8 decoder on cpp finf http://utfcpp.sourceforge.net/. So when i find litle time must studdy this. Thank's a lot.
Jim

helios (17607)

My own design:

/*
* Copyright (c) 2009, Helios (helios.vmg@gmail.com)
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*     * Redistributions of source code must retain the above copyright notice,
*       this list of conditions and the following disclaimer.
*     * Redistributions in binary form must reproduce the above copyright
*       notice, this list of conditions and the following disclaimer in the
*       documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY HELIOS "AS IS" AND ANY EXPRESS OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
* EVENT SHALL HELIOS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
* OF THE POSSIBILITY OF SUCH DAMAGE.
*/

typedef unsigned char uchar;

/*
string: a UTF-8-encoded C string (nul terminated)
Return value: a wchar_t C string.

The function handles memory allocation on its own.

Limitations: Only handles the range [U+0000;U+FFFF], higher code points are
changed to '?'.

Assumptions: sizeof(wchar_t)>=2
*/
wchar_t *UTF8_to_WChar(const char *string){
	long b=0,
		c=0;
	if ((uchar)string[0]==BOM8A && (uchar)string[1]==BOM8B && (uchar)string[2]==BOM8C)
		string+=3;
	for (const char *a=string;*a;a++)
		if (((uchar)*a)<128 || (*a&192)==192)
			c++;
	wchar_t *res=new wchar_t[c+1];
	res[c]=0;
	for (uchar *a=(uchar*)string;*a;a++){
		if (!(*a&128))
			//Byte represents an ASCII character. Direct copy will do.
			res[b]=*a;
		else if ((*a&192)==128)
			//Byte is the middle of an encoded character. Ignore.
			continue;
		else if ((*a&224)==192)
			//Byte represents the start of an encoded character in the range
			//U+0080 to U+07FF
			res[b]=((*a&31)<<6)|a[1]&63;
		else if ((*a&240)==224)
			//Byte represents the start of an encoded character in the range
			//U+07FF to U+FFFF
			res[b]=((*a&15)<<12)|((a[1]&63)<<6)|a[2]&63;
		else if ((*a&248)==240){
			//Byte represents the start of an encoded character beyond the
			//U+FFFF limit of 16-bit integers
			res[b]='?';
		}
		b++;
	}
	return res;
}

//Do not call me.
long getUTF8size(const wchar_t *string){
	if (!string)
		return 0;
	long res=0;
	for (;*string;string++){
		if (*string<0x80)
			res++;
		else if (*string<0x800)
			res+=2;
		else
			res+=3;
	}
	return res;
}

/*
string: a wchar_t C string (nul terminated)
Return value: a UTF-8-encoded C string.

The function handles memory allocation on its own.

Limitations: Only handles the range [U+0000;U+FFFF], higher code points are
changed to '?'.

Assumptions: sizeof(wchar_t)>=2
*/
char *WChar_to_UTF8(const wchar_t *string){
	long fSize=getUTF8size(string);
	char *res=new char[fSize+1];
	res[fSize]=0;
	if (!string)
		return res;
	long b=0;
	for (;*string;string++,b++){
		if (*string<0x80)
			res[b]=(char)*string;
		else if (*string<0x800){
			res[b++]=(*string>>6)|192;
			res[b]=*string&63|128;
		}else{
			res[b++]=(*string>>12)|224;
			res[b++]=((*string&4095)>>6)|128;
			res[b]=*string&63|128;
		}
	}
	return res;
}

anders43 (125)

this link is the one I usually give people who don't know Unicode that well:

http://www.joelonsoftware.com/articles/Unicode.html

dkaip (196)

Thank all for your help. Today i just make with utfcpp project in sf.net
This is the code . The test.txt file goes to new.txt and are the same.
But i will try the mr. helios program and thank's for help. It is very important to have some codes ...
Good day.

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include "utf8.h"
using namespace std;

int main(int argc, char* argv[])
{
argc = 2;
argv[1]="test.txt";
ofstream outfile ("new.txt"/*,ofstream::binary*/);
    if (argc != 2) {
        cout << "\nUsage: docsample filename\n";
        return 0;
    }
    const char* test_file_path = argv[1];
    // Open the test file (must be UTF-8 encoded)
    ifstream fs8(test_file_path);
    if (!fs8.is_open()) {
    cout << "Could not open " << test_file_path << endl;
    return 0;
    }
    // Read the first line of the file
    unsigned line_count = 1;
    string line;
    if (!getline(fs8, line))
        return 0;
    // Look for utf-8 byte-order mark at the beginning
    if (line.size() > 2) {
        if (utf8::is_bom(line.c_str()))
            cout << "There is a byte order mark at the beginning of the file\n";
    }
    // Play with all the lines in the file
    do {
       // check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)
        string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
        if (end_it != line.end()) {
            cout << "Invalid UTF-8 encoding detected at line " << line_count << "\n";
            cout << "This part is fine: " << string(line.begin(), end_it) << "\n";
        }
        // Get the line length (at least for the valid part)
        int length = utf8::distance(line.begin(), end_it);
        cout << "Length of line " << line_count << " is " << length <<  "\n";
        // Convert it to utf-16
        vector<unsigned short> utf16line;
        utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
        // And back to utf-8
        string utf8line;
        utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
        // Confirm that the conversion went OK:
        if (utf8line != string(line.begin(), end_it))
            cout << "Error in UTF-16 conversion at line: " << line_count << "\n";
            outfile << line<<endl;
        getline(fs8, line);
        line_count++;
    } while (!fs8.eof());
    return 0;
}

Last edited on

dkaip (196)

Because i am a newbie i dont know some things. Compiler says error. BOM8A and BOM8B and BOM8C must declared first.
Thank's
Jim

helios (17607)

Sorry. I had originally commented out those lines but later decided against. Add this at the top:

1
2
3

#define BOM8A 0xEF
#define BOM8B 0xBB
#define BOM8C 0xBF

dkaip (196)

It works just fine.
I just try the strtok but there is not on codeblocks. Also wstrtok dont exist. I dont know how i take the words of a line string.
There is a code for doing this ...
When i take a line fron utf-8 file wfstream fs8;wstring line;getline( fs8, line); file fs8 is already utf-8 i think. The line must be utf-8. Then i must convert to wchar_t chars for editing. Something practical?

//To get a wide line
wfstream fs8;
wstring line;
getline( fs8, line);

//To store words in a vector from wide string
vector<wstring>words;
wstring::size_type pos;
while (true)
{
	pos = line.find(L' ');
	if ( pos != wstring::npos )
	{
		words.push_back(line.substr(0,pos));               
		line.erase(0,pos+1);//notice that this will modify your starting string                
	}else
	{
		words.push_back(line);
		break;
	}
}

Last edited on

Topic archived. No new replies allowed.

C++

Forum

wchar_t and reading unicode strings