String encoding coming from libcurl

Hello !
I'm struggling to reencode a string that is a HTML document.
Everything works, except non standard characters (French characters but also some points !) that aren't expressed through HTML escape sequences.
I think it's because their numbers are encoded in hexadecimal and libcurl interpret it as a decimal number, but sometimes, characters are split it two char !
I have tried to change the encoding of the characters on VS and to reencode the string, but it didn't work.
I am on Windows 10 x64 using Visual Studio 2019 community.

Few example (becomes = ->):
Crédits -> Cr├®dits
badges… Retrouvez -> badgesÔǪ┬áRetrouvez


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include "curl/curl.h"
#include <string>
#include <locale> 
#include <codecvt> 

using namespace std;

static size_t cb(void* contents, size_t size, size_t nmemb, void* userp)
{
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}


int main(int argc, char* argv[])
{
    CURL* req = curl_easy_init();
    CURLcode res;
    string wow;
    if (req)
    {
        curl_easy_setopt(req, CURLOPT_URL, "https://*.net");
        curl_easy_setopt(req, CURLOPT_WRITEFUNCTION, cb);
        curl_easy_setopt(req, CURLOPT_WRITEDATA,&wow);
        res = curl_easy_perform(req);
        if (res != CURLE_OK)
        {
            fprintf(stderr, "curl_easy_operation() failed : %s\n", curl_easy_strerror(res));
        }
    }
    curl_easy_cleanup(req);
    cout << wow << endl;
    return 0;

}


I'm stuck because my webscrapper HAS to recognize these characters
Thank you very much
Last edited on
The page in question is encoded in UTF-8. You need to first read the entire contents as a binary buffer then use a UTF-8 decoder to get back a std::wstring (or std::u32string if wstring is not wide enough for you).
Thank you,
I searched on Google how to convert strings to binary buffers, and I added this to my code :
1
2
    wstring wwow = wstring_convert<codecvt_utf8<wchar_t>>().from_bytes(wow);
    wcout << wwow << endl << wwow.size();

It's better, but still not good: it appears that non-ASCII characters are shifted, from ~15 and I couldn't find a logic to this.
Example:
(é->Ú ; è ->Þ ; à -> Ó)

On top of that I'm worried, I wanted my crawler to be really fast, (with only 1 string traversal) and this operation makes it significantly slower.
Last edited on
it appears that non-ASCII characters are shifted
Are you on Windows? That's just how they're being displayed on the console. The actual data in memory is correct.

I wanted my crawler to be really fast, (with only 1 string traversal)
As long as the logic you're trying to run on the data doesn't require random access (i.e. seeking back and forth), you could in theory design a state machine capable of decoding the UTF-8, parsing the HTML, and searching for interesting strings in the document, all in one pass of the raw binary data coming from the network. The problem is that it wouldn't be faster than doing multiple passes over the data, it would just use less memory, since you wouldn't need to hold the entire document in memory at once. It would only be faster if, for example, the data you need is within the first 10% of the document, and once you have that you can abort the download of the remaining 90%.

It's possible that decoding the UTF-8 is slower than just processing the raw binary (it shouldn't be, but it's possible that the codecvt implementation is inefficient), but if you need to process the character values and not the byte values then it doesn't make any difference, because you need to decode the UTF-8 one way or another. At best you could try a different decoder.
This is really weird,
I followed these instructions
https://stackoverflow.com/a/1875622
and after I ticked the option and I build my program with this in the main :
1
2
3
4
5
6
int main(int argc, char* argv[])
{
    string tmp = "é";
    cout << tmp << endl;
    return 0;
}

It display:
Ú

What am I missing ?!

EDIT:
Ok, I am trying to save cout to file in order to see if it's just console "mispelling"
Last edited on
Ok I have successfully exported in a file and characters are correct !

One last thing, is there any faster decoder than codecvt ?

I precisely wanted to do a state machine, but I'm convinced it would be useful, in order to hold hundreds of visited URLs. It would significantly reduce the number of comparisons, wouldn't it ?
Last edited on
What exactly are you trying to do?
A bot that searches for websites with enough keywords from a keyword list and then creates a graphviz map. atm I'm trying to resolve the encoding problem in a separate project.

I'm getting mad at another issue; I tried to build my string as a wstring from the beginning to improve my code but my wstring won't be printed ?!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <iostream>
#include "curl/curl.h"
#include <string>
#include <locale> 
#include <codecvt>
#include <fstream>

using namespace std;

static size_t cb(void* contents, size_t size, size_t nmemb, void* userp)
{
    ((wstring*)userp)->append((wchar_t*)contents, size * nmemb);
    return size * nmemb;
}
int main(int argc, char* argv[])
{
    CURL* req = curl_easy_init();
    CURLcode res;
    wstring wow;
    if (req)
    {
        curl_easy_setopt(req, CURLOPT_URL, "https://frenchwebsite.net");
        curl_easy_setopt(req, CURLOPT_WRITEFUNCTION, cb);
        curl_easy_setopt(req, CURLOPT_WRITEDATA,&wow);
        res = curl_easy_perform(req);
        if (res != CURLE_OK)
        {
            fprintf(stderr, "curl_easy_operation() failed : %s\n", curl_easy_strerror(res));
        }
    }
    curl_easy_cleanup(req);
    wcout << wow << endl << wow.size();
    cout << wow.size();
    return 0;
}

Returns:
32582
(this means it has the good number of wchar_t)
I can't export the string into a file neither.
Last edited on
The above code won't work. If the website is returning the content encoded as UTF-8 then you have no choice but to run a decoder, if you want to get the character data out. Merely casting the pointer to the type you need doesn't do anything.
Also, since sizeof(wchar_t) > sizeof(char), line 12 will cause an out-of-bounds access when std::wstring::append() attempts to read size * nmemb characters, when in fact only size * nmemb bytes are available.

A bot that searches for websites with enough keywords from a keyword list and then creates a graphviz map. atm I'm trying to resolve the encoding problem in a separate project.
So how does this problem statement relate to this:
I precisely wanted to do a state machine, but I'm convinced it would be useful, in order to hold hundreds of visited URLs.
?
Once you've extracted the useful data from a request and stored its relationships to the data you already had, why would you need to keep the content around? Like, imagine something like
1
2
3
4
struct Page{
    std::string url;
    std::vector<Page *> links;
};
A graph using this class could be constructed without very complex logic:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Page *construct_graph(const std::string &initial_url){
    std::map<std::string, Page *> pages;
    return get(initial_url, pages);
}

Page *get(const std::string &url, std::map<std::string, Page *> &pages){
    auto it = pages.find(url);
    if (it != pages.end())
        return it->second;
    auto ret = new Page;
    ret->url = url;
    pages[url] = ret;
    auto links = extract_links(decode_utf8(download(url)));
    for (auto &link : links)
        ret->links.push_back(get(link, pages));
    return ret;
}
> On top of that I'm worried, I wanted my crawler to be really fast, (with only 1 string traversal)
> and this operation makes it significantly slower.
You need to make it 'right' before you even begin to think about making it 'fast'.

The elephant in the room is your network speed.
You can easily afford small performance sacrifices for the sake of clean, easy to read code.
Topic archived. No new replies allowed.