unicode character

In mongolian cyrillic, one character has 2-byte size.
'а' - 2 byte,
'б' - 2 byte,
'в' - 2 byte,
'г' - 2 byte,
'д' - 2 byte and so on.
In mongolian traditional script one character has 3-byte size.
'ᠠ' - 3 byte
'ᠳ' - 3 byte
'ᠭ' - 3 byte
'ᠳ' - 3 byte and so on.
English letters and numbers and special character have 1- byte size.
So the size of following is 8 byte.
'й' - 2 byte, 'а' - 2 byte, 'ᠭ' - 3 byte, '1' - 1 byte.
 
string s="йаᠭ1";

I need to separate this word letter by letter like "йаᠭ1"=> 'й', 'а', 'ᠭ', '1'.
So if I cut first 2 bytes of this word, the letter is ᠌᠌᠌᠌"й".
1
2
3
4
string s1=s.substr(0,2);//й
string s2=s.substr(3,2);//а
string s3=s.substr(4,3);//ᠭ
string s4=s.substr(8, 1);//1 

The problem is that how I separate any word like this word, made up multi-languages character letter by letter. How do i know that I should cut first 2, or,3 or 1 byte of word?


You could convert each multibyte character to wide on its own (C-style functions mbtowc and mbrtowc can do that), and store the substrings individually:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <iostream>
#include <vector>
#include <string>
#include <cwchar>
#include <clocale>
int main()
{
    std::string s="йаᠭ1";
    std::vector<std::string> letters;

    std::setlocale(LC_ALL, ""); // or "en_US.utf8", or any other .utf8
    std::mbstate_t state = std::mbstate_t(); // initial state
    const char* ptr = s.c_str();
    const char* end = s.c_str() + s.size();
    int len;
    wchar_t wc;
    while((len = std::mbrtowc(&wc, ptr, end-ptr, &state)) > 0) {
        letters.push_back(std::string(ptr, ptr+len));
        ptr += len;
    }

    for(size_t n = 0; n < letters.size(); ++n)
        std::cout << "The size of the letter " << letters[n] << " in UTF-8 is " << letters[n].size() << '\n';
}

online demo: http://ideone.com/sa2gmJ

but normally you don't ever need to know this, portable programming doesn't care how your characters are encoded.
If I understood you correctly, you are talking about MBCS where a string can have non-uniform character bytes. For example a two byte character and then 1 byte and then a three byte character. This type of strings are not used today and a string is always uniform.

for these kind of strings api's such as CharNext(), CharPrev(), IsDBCSLeadByte() were used. But these are not used today anymore.
This type of strings are not used today
What? UTF-8 is not used anymore?
MiiNiPaa - you didn't notice what I am saying. I am not talking about UTF-8. I am talking about DBCS (kind of) and this is what I have understood from the OP's problem statement.
http://cplusplus.com/forum/general/99880/
Previous OP thread. It should help with his problem.
Looks like he want to parse an UTF-8 string. Problem is, UTF-8 does not have a fixed lenght and subscript operator will return only parts of symboles.
@Jijgee - if you are using unicode characters, you should be using constant byte length for characters. Variable byte length will not work for you and will be complex.
@writeonsharma, But utf8 doesn't support fixed byte length for characters. What do I need to do?
@Cubbi, I tried your code. But result is not same as yours.
My result is:
The size of the letter � in UTF-8 is 1
The size of the letter � in UTF-8 is 1
The size of the letter � in UTF-8 is 1
The size of the letter � in UTF-8 is 1
The size of the letter � in UTF-8 is 1
The size of the letter � in UTF-8 is 1
The size of the letter � in UTF-8 is 1
The size of the letter 1 in UTF-8 is 1
My C compiler is GCC, Do you think it is because of compiler?
Last edited on
Did you save the file as UTF-8?
@Jigee: it was tested with gcc, and the link I posted uses gcc as well. You'll have to post more details: at least the OS and the user locale settings.
@Cubbi: The OS is windows 7 and the user locale is "Mongolian (Cyrillic)_Mongolia.1251".
1
2
setlocale(LC_ALL, "en_US.utf8");
printf("Localse is %s\n", set(LC_ALL, NULL));

The output is
C
So setlocale did not work as expected. What is wrong?
Last edited on
That makes it more difficult: your OS doesn't provide Unicode conversion facets. If you could switch to Visual Studio (2010 or newer), you could use the locale-independent C++11 Unicode conversion facets, but GCC didn't implement those (typical gcc users are on linux, where the OS actually provides all Unicode needs)

You will have to use a library, I would suggest ICU or boost.locale. For simple things (like what this thread is about), you could parse utf8 yourself, it's really simple (as Peter87 implied by his wikipedia link earlier).
Last edited on
I solved the problem on windows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <stdio.h>
#include <Windows.h>
using namespace std;
int main()
{
    LPCWSTR s=L"йаᠭ1";
    char str[8];
    size_t sizeRequired;
    for(int n = 0; n < 4; ++n)
    {
      ZeroMemory(str, 8);
      sizeRequired = WideCharToMultiByte( CP_UTF8, 0, &s[n], 1, str, 4,  NULL, NULL);
      std::cout << "The size of the letter " << str << " in UTF-8 is " << sizeRequired << endl;
      //std::cout << "The size of the letter " << s[n] << " in UTF-8 is " << sizeRequired << '\n';
    }
	return 0;
}

Thank you everybody.
Topic archived. No new replies allowed.