Why is the program giving me odd characters after Chinese characters?

closed account (42TXGNh0)
Here's the problem, the program is used to reverse text while I am learning C++, and that is an exercise. It worked nicely through English characters. Then I started to think: what if I add Chinese characters into the program? It gives me something pretty strange. Anyway, here's the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <iostream>
#include <windows.h>

using namespace std;

int main()
{
	char p[]="Hello. 您好。";
	cout << "Text: " << p << endl;
	cout << "Reversed text: ";
	for(int i=strlen(p)-1;i>=0;i--)
	{
		if ((int)p[i]<0) //to see is that a Chinese character
		{
			cout << p[i] << p[i+1];
			i--;
		}
		else {cout << p[i];}
		Sleep( 100 );
	}
	cout << endl;
	system("pause >nul");
	return 0;
}


The strange thing is there is always a character at the front after reversing which should not exist. Example:
---------- ---------- ---------- ---------- ----------
Text: Hello. 您好。
Reversed text: C。好您.olleH

---------- ---------- ---------- ---------- ----------
Text: Hello. 您好嗎?
Reversed text: H?嗎好您.olleH

---------- ---------- ---------- ---------- ----------
It just doesn't seem to make sense, I tried to prevent it by changing the for(int i=strlen(p)-1;i>=0;i--) into for(int i=strlen(p)-2;i>=0;i--), but then it would make an English character disappear if it ends with an English character. Can anyone help to fix this?
Last edited on
You are scanning the string backwards and determining if you have found a Chinese character by seeing if the byte value is >127 (which you aren't doing quite correctly since you are assuming that char is signed). But by then, you have already gone past, and printed, the other bytes that make up the character, and you've printed them in the wrong order. You want to reverse the characters, not the individual bytes that make up a multi-byte character.

So to handle the Chinese characters (or UTF8 generally) you need to scan the string forwards. One possibility is to make an array of the start indices of the characters and then go through that in reverse to print the characters from the string. (see code below)

Alternatively, you could scan the string backwards but always keep three bytes in a buffer queue. That way, if you come across a multi-byte-indicating byte code you haven't already printed out the bytes (in the wrong order).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <iostream>
#include <cstring>
using namespace std;

int nbytes(char ch) {
    int n = (unsigned char)ch;
    if      (n >= 240) return 4;
    else if (n >= 224) return 3;
    else if (n >= 192) return 2;
    return 1;
}

int main() {
	char p[]="Hello. 您好。";
	cout << "Text: " << p << endl;

    int len = strlen(p);
    //cout << "len: " << len << "\n\n";

    int a[100] = {0};
    int nchars = 0;
    for (int skip = 0, i = 0; i < len; i++) {
        if (skip > 0)
            --skip;
        else {
            a[nchars++] = i;
            skip = nbytes(p[i]) - 1;
        }
    }

    // for (int i = 0; i < nchars; i++) cout << a[i] << ' '; cout << '\n';    

    for (int i = nchars; i-- > 0; ) {
        cout << p[a[i]];
        int n = nbytes(p[a[i]]);
        for (int j = 1; j < n; j++)
            cout << p[a[i]+j];
    }
    cout << '\n';

	return 0;
}

Last edited on
closed account (42TXGNh0)
Thanks for your reply, but your provided code still doesn't work, sadly.
It would become as the same result of non-UTF8 processing result...
Text: Hello. 您好。
C》屹?.olleH

It works in http://cpp.sh but not in Dev-C++...
Last edited on
It works for me. You must have done something wrong (in Dev-C++, at least). You probably ran your old code somehow.
[output]
$ ./reverse
Text: Hello. 您好。
。好您 .olleH
[/code]
closed account (42TXGNh0)
I know that, but I have made a new source file for testing the code so that it is a completely new file, therefore, I didn't run my old codes. (T.T)
Never mind, I'll check the settings.
closed account (42TXGNh0)
Ugh... No idea what have done wrong...
I overcomplicated the code. Here's a better version.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <cstring>
using namespace std;

int nbytes(char ch) {
    int n = (unsigned char)ch;
    if      (n >= 240) return  4;
    else if (n >= 224) return  3;
    else if (n >= 192) return  2;
    else if (n >= 128) return -1;
    return 1;
}

int main() {
    char p[]="Hello. 您好。";
    cout << "Text: " << p << endl;

    int len = strlen(p);

    for (int i = len; i-- > 0; ) {
        int n = nbytes(p[i]);
        if (n != -1)
            cout << p[i];
        for (int j = 1; j < n; j++)
            cout << p[i+j];
    }
    cout << '\n';

    return 0;
}

Last edited on
Do the non-reversed Chinese characters appear correctly?
Topic archived. No new replies allowed.