Convert chars to utf-8 hex strings

Hi, mates!

I am trying to convert some chars to UTF-8 strings...

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
std::string gethex(char c)
{

/* EXAMPLE
    if (c == 'é')
    return "%c3%a9";
    
etc...

 I need a function that converts chars like "á, é, í, ã" to UTF-8 hexadecimal strings...

*/
}

std::string encode(std::string str)
{
static std::string unreserved = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~";
std::string r;

    for (int i = 0; i < str.length(); i++ )
    {
      char c = str.at(i);
      if (unreserved.find(c) != -1)
        r+=c;
        else
        r+=gethex(c);
    }

return r;
}


http://www.url-encode-decode.com does it. Choose UTF-8, type some character and click 'Url Encode'.
Last edited on
Do you want URL encoding or UTF-8 encoding? They're very different.
I need a function that converts chars like "á, é, í, ã" to UTF-8 hexadecimal strings...


The thing is... characters in a string are already encoded as something. They have to be. So you have to ask yourself whether or not the string is already UTF-8 encoded.

If it isn't... you'll have to find out what encoding it's in, and convert that to UTF-8.

Once you have a UTF-8 string, it's just a matter of looking at (and printing) the values as integers rather than as chars:

1
2
3
char example = 'a';

cout << hex << static_cast<int>(example);  // prints '61' 
I want something like below:

1
2
3
4
5
6
7
if (c == 'é')
    return "%c3%a9";

if (c == 'á')
   return "%c3%a1";

etc
Yes, but 'c' in this case is just going to be an integer. All characters are represented by the computer as an integer.

The char data type is the same as the int data type, only smaller in size. The character it contains is really the integral ID of a character.

So this:

1
2
3
char c = 'a';

if(c == 0x61)  // <- this will be true, because 'a'==0x61 


So if all you want is to print the character as an integer... then that is the code I already posted:

1
2
3
char example = 'a';

cout << hex << static_cast<int>(example);  // prints '61' 


But the real question here how is your 'c' encoded? Is it UTF-8 or is it some other encoding?

There is no way to solve this problem unless you know what kind of characters you're dealing with. In the end you just have a bunch of numbers, and in order to do this properly you need to know what those numbers represent.


So where are you getting 'c' from? A file? The user?
It is UTF-8 (hex).

In Javascript, it would be: http://pastebin.com/PaRgqfej

Here is a table: http://www.utf8-chartable.de/

Here is a sample: http://www.url-encode-decode.com/

Thanks in advance.
Something like this?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <string>
#include <sstream>
#include <iostream>

std::string hex( unsigned int c )
{
    std::ostringstream stm ;
    stm << '%' << std::hex << std::uppercase << c ;
    return stm.str() ;
}

std::string url_encode( const std::string& str )
{
    static const std::string unreserved = "0123456789"
                                            "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                                            "abcdefghijklmnopqrstuvwxyz"
                                            "-_.~" ;
    std::string result ;

    for( unsigned char c : str )
    {
        if( unreserved.find(c) != std::string::npos ) result += c ;
        else result += hex(c) ;
    }

    return result ;
}

int main()
{
    std::string test = u8"Hello World! á, é, í, ã" ;
    std::cout << test << '\n'
               << url_encode(test) << '\n' ;
}

http://ideone.com/ssgW1h
It gives a hex, but the result is not UTF-8 hex.

In UTF-8 encoding, "á" is "%c3%a1", not "%FFFFFFE1".
> In UTF-8 encoding, "á" is "%c3%a1"

It does give %C3%A1 See the output generated here: http://ideone.com/kmeE7O

To get lower case characters, change line 8
// stm << '%' << std::hex << std::uppercase << c ;
stm << '%' << std::hex << std::nouppercase << c ;


> not "%FFFFFFE1".

Treat each byte in the utf-8 encoded string as an unsigned char;
the default char may be a signed integral type.
for( unsigned char c : str ) { /* ... */ }
Here is my code:

main.cpp : http://pastebin.com/DA2g16LW

encode.h : http://pastebin.com/1xp6eBpS

It does compile. The problem is that the encoding does not work.

For example, you can run the project, press CTRL+SPACE, type 0, press Enter, type 2, press Enter, open notepad, type something and press F4.

SFML is needed.
> Here is my code:
> SFML is needed.

1. Write a simple text based (write to stdout) program to test your encode.h - something similar to the snippet I had posted.

2. If it does not work, post the (strictly non-SFML) code here, and we can have a look at it.
http://ideone.com/6r96IU

&q= is the important part.

It works fine there (Ideone).

But, when I compile using GCC, the result is:
%e1%e9%ed%f3%fa
Are you using an IDE like CodeBlocks?
If so, save your source file(s) with UTF-8 encoding. Menu => Edit => File Encoding => UTF-8

if not, just use notepad: Menu => File => Save As => Encoding: UTF-8
Last edited on
Now it works... But only when I type the string direct in the cpp file.

If I get the clipboard text (via codes) and attempt to translate it, the conversion does not work well (%e1%e9 ...).
Last edited on
> Now it works... But only when I type the string direct in the cpp file.

It works when the text in question is UTF-encoded.


> If I get the clipboard text (via codes) and attempt to translate it, the conversion does not work well

It does not work when the text in question is not UTF-encoded.
Topic archived. No new replies allowed.