Embed search engines into my C++ code

Hi all:

I need to write a program in C++ that can search through google and other search engines.

For example, it accepts the user input of "cat", then it will hook up with google image search and then download first 10 images of the results into a folder on the local machine.

Can anyone shed light on this?


I looked at google image search api, which requires javascript interaction. I'm a bit confused what to do next. If you could explain some details of google image search api, I'll be very thankful.
Hi,

You should use socket programming. Search socekt programming on google.

i.e:
http://beej.us/guide/bgnet/output/html/singlepage/bgnet.html

You can use boost library:
http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio.html

check the http client in example section:
http://www.boost.org/doc/libs/1_46_1/doc/html/boost_asio/examples.html

You can use libCURL to search google:
http://curl.haxx.se/libcurl/


@screw:
Beautiful!
I'll absolutely check it out.
I think I'm going to try libcurl first, and I'll definitely try boost even if libcurl works.
Thank you for the links.


@Galik:
Thank you, Galik! I've installed everything and I've run a few sample programs without any problem.
I've successfully downloaded a page but don't know how to download all images (or specify the conditions for images that should be downloaded) in that page.

There are indeed examples and api tutorials on the site, but relevant ones seem to be quite long and it is going to take some time of mine. So please feel free to provide links to easy/quick examples and tutorials about downloading images with libcurl!

Many thanks again.

@h9uest

Use libCURL the same way you download a web-page to download the images. Put in the URL of the image and write it to a file with the correct extension in its name and that should be fine.

Here is an adaptation of an example of using libCURL I posted here recently:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <curl/curl.h>
#include <fstream>
#include <sstream>
#include <iostream>

// callback function writes data to a std::ostream
static size_t data_write(void* buf, size_t size, size_t nmemb, void* userp)
{
	if(userp)
	{
		std::ostream& os = *static_cast<std::ostream*>(userp);
		std::streamsize len = size * nmemb;
		if(os.write(static_cast<char*>(buf), len))
			return len;
	}

	return 0;
}

/**
 * timeout is in seconds
 **/
CURLcode curl_read(const std::string& url, std::ostream& os, long timeout = 30)
{
	CURLcode code(CURLE_FAILED_INIT);
	CURL* curl = curl_easy_init();

	if(curl)
	{
		if(CURLE_OK == (code = curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, &data_write))
		&& CURLE_OK == (code = curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 1L))
		&& CURLE_OK == (code = curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L))
		&& CURLE_OK == (code = curl_easy_setopt(curl, CURLOPT_FILE, &os))
		&& CURLE_OK == (code = curl_easy_setopt(curl, CURLOPT_TIMEOUT, timeout))
		&& CURLE_OK == (code = curl_easy_setopt(curl, CURLOPT_URL, url.c_str())))
		{
			code = curl_easy_perform(curl);
		}
		curl_easy_cleanup(curl);
	}
	return code;
}

// Image URL
std::string url = "http://thecandybros.co.uk/images/Cfudge.jpg";

int main()
{
	curl_global_init(CURL_GLOBAL_ALL);

	std::ofstream ofs("output.jpg", std::ostream::binary);

	if(CURLE_OK == curl_read(url, ofs))
	{
		// Image successfully written to file
	}

	curl_global_cleanup();
}

The original example is here:
http://cplusplus.com/forum/unices/45878/#msg249287
@Galik:

Thank you!
I think my wording confused you.

The images I'm interested to download are the return results of a google page.
For example, if I post a query to google, and the page will contain lots of images, embedded in html tags. It seems that xml parsing is necessary to retrieve the image urls.

An example on the libcurl site:
http://curl.haxx.se/libcurl/c/example.html

see "HTML parsing".

The main.c alone contains 6200 lines of code!


I did some reading and found a pretty good xml parsing tool: libxml
But again, it looks a bit overwhelming ...


I definitely understand that's what it should be like about cs: keep learning new stuff. But given my current situation, it's a bit awkward because I don't have that much time.


Yeah, so if you have some suggestions, like, those that can help me avoid the nasty html parsing or some good and handy tools for the task of "downloading imgs from google image search result page", please let me know!

I guess I'll have to do it the hard way if no shortcuts are available.

Thanks again! :)
If you know regular expressions then you could use those to extract the image URLs from the returned web-page. Boost have a good regular expressions library:

http://www.cs.brown.edu/~jwicks/boost/libs/regex/doc/introduction.html

You probably need regex_search()
http://www.cs.brown.edu/~jwicks/boost/libs/regex/doc/regex_search.html


@Galik:
Many thanks!

I've decided to temporarily go on with the main thing of my project, and will get back to this issue later.
Topic archived. No new replies allowed.