cURL and c_str() wierdness

Hello all, please help out a Biologist:

I am trying to parse the following URL using the cURL library:

www.ncbi.nlm.nih.gov/nucleotide/? term = Anthoxanthum[organism] AND 2003/7/25:2005/12/27[Publication Date]&format=text

but cURL returns xml (the default, not text I've asked for).

I'm using this code line:
curl_easy_setopt(curl, CURLOPT_URL, URL.c_str());

Here's the wierd thing: when I replace the "URL.c_str()" above with the actual text of the web search I want to do, it works fine. Also, if I paste in the URL from fout<<URL, that works fine in a browser.

Seems to me it's a c_str() problem, maybe the "&" or "="? but I can't figure it out, so I turn to the cplusplus forum for their usual wisdom.
closed account (Dy7SLyTq)
are you sure that URL holds the correct value?
The URL works fine (pasted into browser), also works fine if I output the string to a text file and paste that.

The returned cURL data is in the desired text form when I explicitly define the URL:
curl_easy_setopt(curl, CURLOPT_URL, "www.ncbi.nlm.nih.gov/nucleotide/? term = Anthoxanthum[organism] AND 2003/7/25:2005/12/27[Publication Date]&format=text")

but if I say:

1
2
3
URL= "www.ncbi.nlm.nih.gov/nucleotide/? term = Anthoxanthum[organism] AND 
2003/7/25:2005/12/27[Publication Date]&format=text";
curl_easy_setopt(curl, CURLOPT_URL, URL.c_str());

it doesn't.
Is the string URL still in scope when you call curl_easy_perform (or whatever)??

(A string literal is stored in the const segment of your exe, so it will never be deallocated. But a string will be destroyed as soon as it goes out of scope, invalidating the (const) char* returned by c_str().)

Andy

PS Not directly related to the c_str()/char* problem, but the documention for CURLOPT_URL does say you should specify the scheme (e.g. http://, ftp:://, ldap://, ...) as part of the URL.

CURLOPT_URL

Pass in a pointer to the actual URL to deal with. The parameter should be a char * to a zero terminated string which must be URL-encoded in the following format:

scheme://host:port/path

http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTURL
Last edited on
That was it, naraku933!!

I played around with your answers and it looks like it's only the spaces (%20) that matter; I can leave the brackets in as [] not %5B... %5D.

Thanks for your help, I'd never have found that on my own.
The recommended way is to use curl_easy_escape() on initial string and libcurl does it for you correctly.
http://curl.haxx.se/libcurl/c/curl_easy_escape.html
modoran,

I tried your suggestion, but can't get it to work in a similar application.

Neither does the suggestion to "hard encode" the characters with %hex format.

Here's what I have:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
for (int j=0; j<number_of_ids; ++j)
    {
        working = "www.ncbi.nlm.nih.gov/nuccore/" + genbank_id[j] +"?report=fasta&format=text";
        URL.push_back(working);
        fout<<URL[j]<<endl;
    }
    cout<<"URL vector populated"<<endl;

    //obtain url as FASTA
    for (int j = 0; j<(int)URL.size(); ++j)
    {
        CURL *curl;
        CURLcode res;
        string readBuffer;
        curl = curl_easy_init();
        if(curl)
        {
            curl_easy_escape(curl, URL[j].c_str(),0);
            fout<<URL[j].c_str()<<endl;
            curl_easy_setopt(curl, CURLOPT_URL, URL[j].c_str());
            curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L); //follow redirection
            curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
            curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);

            // Perform the request, res will get the return code
            res = curl_easy_perform(curl);

            // Check for errors
            if(res != CURLE_OK)
            {
                fprintf(stderr, "curl_easy_perform() failed: %s\n",
                curl_easy_strerror(res));
            }
            curl_easy_cleanup(curl);
            //curl_free (curl);
        }
        //Populate FASTA vector from readBuffer
        FASTA.push_back(readBuffer);
        DNA.push_back(readBuffer);
    }
    cout<<"FASTA strings populated into FASTA and DNA vector"<<endl;


I know the URL is good because when i paste this in to a browser, I get my data: (where FJ817486 is the first genbank ID)
http://www.ncbi.nlm.nih.gov/nuccore/FJ817486?report=fasta&format=text

Any suggestions?
Are you handling javascript somehow? I don't get anything but the main page back with a warning about the site requiring javascript:
<strong>Warning:</strong>
The NCBI web site requires JavaScript to function.
<a href="http://www.ncbi.nlm.nih.gov/corehtml/query/static/unsupported-browser.html#enablejs" title="Learn how to enable JavaScript" target="_blank">more...</a>


http://curl.haxx.se/docs/faq.html#Does_curl_support_Javascript_or
norm,

The webpage I'm trying to access is just straight-up old fashioned text. I don't think it's a java issue, because I go the first request to work, but this one won't work even if I explicitly encode the url.

do you know how I can check the URL "sent" by libcurl?
Let me clarify: this is working (sorta) because I get xml format back. The part that isn't working is the "&format=text" bit.
Not being a biologist, I'm probably using incorrect terminology but the data that you want is the genome sequence(?) as shown by this link, right?: http://www.ncbi.nlm.nih.gov/nuccore/FJ817486?report=fasta&format=text

Have you inspected the response that you got back? I used your code and got xml back but that sequence data(?) is not included. I could be wrong, but the page that you want is probably a dynamic web page generated by javascript and thus not possible to retrieve with libcurl.

Edit: typo
Last edited on
norm,

The weblink you post is the sequence I'm trying for, and you're also right in that the xml doesn't seem to include that sequence, or I would just carve it up and get what I wanted.

if I "Inspect Element" on the page you link, I see this:
<script type="text/javascript" src="/portal/js/portal.js?v3%2E5%2E1%2Er392364%3A+Mon%2C+Mar+25+2013+15%3A07%3A09"></script>

which supports your java idea.

So this is just a case of "you can't get there from here"?

Thanks for your help in any case.
So this is just a case of "you can't get there from here"?

Seems that way.

Do they not provide an API?

EDIT:
Here you go: http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records

curl_easy_setopt(curl, CURLOPT_URL, "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=FJ817486&rettype=fasta&retmode=text");
Last edited on
Topic archived. No new replies allowed.