C++ Web Extraction

Hello Everyone,

I was wondering how to extract data from websites and use it in my program. For example how can I have my program read the HTML code and save it as a text file? I know I can do this manually on Internet Explore under Page>View Source.I would like my program to do that for me.

Any help would be greatly appreciated

Thanks, Arthur
www.cplusplus.com
Someone ask this question before. Use cURL.
In Linux, you can just use the get command for that.
In Linux, you can just use the get command for that.




But surly you won't use system()...
Another thing you could do is to download the HTML file from the website (not sure how you'd do that by the way) and then just use some good old file I/O.


If thats any help...
If you want it to be automated, then cURL once installed you can use it like any other Unix command in a shell script.

A lot of organizations use cURL to do web extraction. Besides cURL binary, they also provide cURL libraries so you can integrate them into your programs.

Now after extracting the HTML, you want to remove those HTML tags to get to the contents will require a HTML tag parser. This I don't have a candidate yet.

Anyone want to recommend a C++ HTML tag parser? I know in Java there is one.
I'm having trouble picking the cURL download I need, as well as how to install it. I do all my C++ in Visual Studio 2010. What is the right download for me? And can any one guide me through how to add it to my project? Thanks for all the help!
Do you want a ready binary to just run and get results OR do you want to add the cURL libraries into your project so your program can use cURL functionalities ?

In all cases, try contact cURL authors I believe they can help you. Last I visit they also have dedicated cURL forums there.

curl.haxx.se
I want my program to be able to fully function as a exe so... I would think that would be adding the cURL libraries into my project.
I want my program to be able to fully function as a exe so... I would think that would be adding the cURL libraries into my project.


Then you do need the cURL in libraries format then. Contact cURL authors to see if they distribute cURL libraries for Windows. I know for Linux/Unix they do but not sure about Windows though.
Okay great! Thanks for all your help!
I hope this isn't off-topic but if this is more about the productivity then you might want to consider using another language (eg. Perl) for something like this.
Topic archived. No new replies allowed.