html parser

Hi everyone,

I'm making a program to download the html of a website. There is information within the html that I want to extract.

I can use libcurl to download the html but I'm really struggling to parse the html. I've tried converting the html to xml using HTML tidy (recommended here http://www.mostthingsweb.com/2013/02/parsing-html-with-c/) but it just gives:

1
2
3
4
5
line 621 column 13 - Error: <time> is not recognized!
line 3156 column 9 - Error: <time> is not recognized!
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct null not valid
Aborted


Any ideas would be most appreciated.
Why not use a library for reading HTML data instead of using lossy conversion to XML + an XML library?

Also, if the website has an API, it would probably be best to use it. Most web-based APIs are JSON nowadays, making it really easy to get information.
Last edited on
Hi,

Unfortunately the website has no API.

I'm using libxml2 but I'm finding it very difficult as there aren't really any examples!
Topic archived. No new replies allowed.