C++ HTML DOM Parser

Hey everyone,

I'm having a real hard time trying to parse HTML in C++...

Basically, all I want to do is read an html page, parse it and write out the contents of the page into a tab delimited file...

Here's the code that I wrote:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <stdio.h>
#include <windows.h>
#include <wininet.h>
#include <string>
#include <comdef.h>
#include <mshtml.h> 

#import <mshtml.tlb> no_auto_exclude 

#pragma comment(lib, "wininet.lib")

#include <iostream>
#include <fstream>

using namespace std;

int main(int argc, char* argv[]){
	CoInitialize(NULL);

	ofstream dbfile ("output.db");
	string sLI;
	string m_strURL;
	HINTERNET hOpen, hFile; 

	MSHTML::IHTMLDocument2Ptr pDoc;
	HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (void**)&pDoc);

	SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
	VARIANT *param;
	
	hOpen = InternetOpen("UN/1.0", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);

	hFile = InternetOpenUrl(hOpen, "http://online.wsj.com/public/page/news-global-world.html", NULL, 0, 0, 0);

	if(hFile){
		CHAR buffer[10*1024];
		DWORD dwRead;

		while(InternetReadFile(hFile, buffer, 1024, &dwRead)){
			if(dwRead == 0)
				break;

			buffer[dwRead] = 0;

			bstr_t bsData = (LPCTSTR)buffer;
			hr =  SafeArrayAccessData(psa, (LPVOID*)&param);
			param->vt = VT_BSTR;
			param->bstrVal = (BSTR)bsData;

			cout << buffer << endl;
			dbfile << buffer << endl;
			
			hr = pDoc->write(psa);	

		} //end while loop
		
		hr = pDoc->close();
		InternetCloseHandle(hFile);
		SafeArrayDestroy(psa);
	}
	
	InternetCloseHandle(hOpen);
	dbfile.close();
	
	CoUninitialize();
	return 1;
}


but I still can't figure out how to access the DOM elements and print the text content to a file...for example, i want to parse the HTML and print out the text between <li>some content</li> or <div>some more content</div> or <td>yep some more content</td> or <h1>you guessed it...some more content</h1> or whatever other tag...

Example:
1
2
<html><title>my Title
</title><body><ul><li>Body text</li></ul><div>blah blah blah</div></body></html>


Parse ->
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<html>
    <title>
        my Title
    </title>
    <body>
        <ul>
            <li>
                Body text
            </li>
        </ul>
        <div>
            blah blah blah
        </div>
    </body>
</html>


Text file ->
 
Body text {tab} blah blah blah
You should look up strstr() if you want to find specific points within a file.
Why the hell would you use strstr() with C++?

Use std::string::find()
like this:
1
2
3
4
5
6
7
8
std::string line;
size_t pos;

std::getline(myFile, line);

if ((pos = line.find("string")) != std::string::npos) {
    // Found the text at pos
}
Thanks for the reply folks...

But something doesn't seem right...I'm not looking for a specific text - what I want is all content between html tags...for example I want to parse an HTML file and extract all content between <div> tags and insert that content into a tab delimited file

I don't see how using find() or strstr() would help...
You have to know the position of <div> in the file before you can get the data between it. That's why strstr() or find() has been mentioned. Personally, I like strstr() better than find. It's less cumbersome AFAIAC.

How else do think you're going to find out what's between the tags?
@Lamblion...

I was thinking that I can access the HTML DOM and output the content between all - for examples sake - <div></div>
If you want to grab the data from the file itself, then you're engaging in wishful thinking. Unless there's some convention I'm not aware of, the easiest, fastest, and most efficient way to get the data is to use either strstr() or find() to nail down the position of the first <div> and the second <div> and then grab everything in between.
Last edited on
To be honest, I prefer to write code in C because I also find it less "cumbersome." But at the same time, C++'s std::string::find() is probably better to use.

Like this:
1
2
3
4
5
6
7
8
9
10
std::string div; /* Store everything between the div tags */
size_t pos = 0; /* Store the position of the opening tag */

if ((pos = htmlBuffer.find("<div>")) != std::string::npos) { /* Find the div tag */
    pos += std::string("<div>").length(); /* Skip to the end of the tag */

    /* Get everything until the start of the closing tag */
    while (pos++ != htmlBuffer.find("</div>") && pos != std::string::npos)
        div += std::string((char*)&htmlBuffer[pos]); /* Add the char */
}

Perhaps?
Edit: note, that code was rather quickly put together, so forgive the ugly
div += std::string((char*)&htmlBuffer[pos])
That line's pretty nasty; but then again, this is just an example.
Last edited on
Your code demonstrates what I was talking about. I can do the same thing with strstr() with two or three simple lines of code.

The difference is, as usual, that with straight C you GENERALLY get a little closer to the hardware without a lot of overhead under the hood, but there are of course many advantages to C++ versus C. You just have to find the way that you are comfortable with.
I know what you mean; as I say, I do prefer C over C++. But for string manipulation, C++'s std::strings are better.

Yeah it would be easy with strstr(), but again, I think C++ is just better for string manipulation.

Oh and by the way I just realised, you could do that in one line in C++:
1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>

int main(void) {
    std::string divBuffer = "<div>\n\t<img src=\"images/image1.png\" />\n</div>";
    
std::cout << "divBuffer =\n" << divBuffer << "\n";
    
    std::string buf = divBuffer.substr((divBuffer.find("<div>")  + 6), (divBuffer.find("</div>") - 6));
    
    std::cout << "buf =\n" << buf << std::endl;
    
    return 0;
}
divBuffer =
<div>
        <img src="images/image1.png" />
</div>
buf =
        <img src="images/image1.png" />

In conclusion, for string manipulation, C++ > C. I did the actual string copying in one line.
Last edited on
Yeah, it does appear a bit easier with the C++ string manipulators, but I just have a fundamental mental hangup with using MOST of the C++ string functions because of the overhead and because I can't really "see" what's going on.

The main C++ string function I use is StrTrm(), as that's just plain easy. In straight C you'd have to set up a loop to accomplish the same thing.

Generally, I like to stay as close as possible to telling the hardware what to do. That way I learn a lot more. Of course, the logical conclusion is going to learning to be a good Assembly programmer, but for now I've got my hands completely full (and then some) just learning C/C++ and the Win32 API.
To be fair, the overhead is, at worst, minimal. Now, I usually wouldn't be sticking up for C++ in the C vs C++ argument, I generally prefer C; but for string manipulation I think only scripting languages like Perl and Python (which were both designed for string manipulation; or, at least, with it in mind) can beat C++. I've never played with functional languages though; so I don't know about them. As for not being able to understand it properly, that's probably because I wrote it to be compact and not readable, per se.

Also, if you want to stick close to the hardware then yes, you'll want to learn assembly. It'll both exhilerate and terrify you when you find out that "hello world" is this (or at least, it was on DOS):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
mov AH, Eh ; BIOS function to print a character
mov AL, 'H' ; Place the character in AL register
int 10h ; Call the BIOS to print the character

mov AL, 'e'
int 10h

mov AL, 'l'
int 10h
int 10h ; Print l twice

mov AL, 'o'
int 10h

mov AL, ','
int 10h

mov AL, ' '
int 10h

mov AL, 'w'
int 10h

mov AL, 'o'
int 10h

mov AL, 'r'
int 10h

mov AL, 'l'
int 10h

mov AL, 'd'
int 10h

mov AL, '!'
int 10h

Have fun :)
Last edited on
I"ve played with the basics of Assembly, and bought a couple of books. I'll bet you can't get your code to compile now. If I remember correctly, the int 10h statement doesn't work, at least in an inline C/C++ program with Visual Studio. It's been awhile since I've played with it, so I don't remember the exact details, but the old code just won't work in many/most cases.
No, it doesn't work. It's for DOS.

If you can find me a 16-bit assembler, I can get it to assemble into a binary, and sometimes windows will let it work under it's DOS emulation layer; but it doesn't actually work.

And interrupt 0x10 doesn't work because when run in protected (32-bit) mode, the OS doesn't have access to BIOS functions it can access in real mode. The NT DOS emulator emulates that BIOS function for real-mode programs.

Anyway, I gotta get this Perl script working. I'm writing a simple file moving program; but for some weird reason one of my arrays gets cleared when I go to write the files.
Last edited on
@Lamblion & chrisname:

The two of you are making some valid point and giving me some brilliant ideas...however, I'm still not convinced...HTML by nature is extremely forgiving, so HTML sourcecode can be very very very sloppy...

Here's a real world example:
<div class="hat_search_container">
<div class="hat_search" id="hat_search_autocomplete">
<form name = "autocompleteHeaderForm">
<table border="0" cellpadding="0" cellspacing="0" class="autocompleteContainer">
<tr>
<td>
<div class="symbolCompleteContainer">
<div>
<input type="text" name="hat_input" id="hat_input_auto" value="" maxlength="80" autocomplete="off" />
</div>
</div>

<div id="SearchQuoteGoButton" class="hat_button">
<span class="hat_button_text"> SEARCH </span>
</div>

<!--<a class="hat_search_ad" target="_blank" href="http://ad.doubleclick.net/clk;217331088;6853491;m?http://www.principal.com/banners/landing/aboutprincipal.htm">
<img src="http://online.wsj.com/img/principal_logo_transp.gif"/>
</a>-->

<div style="clear:both;"/>
<div id="symbolCompleteResults" class="subSymbolCompleteResults"></div>
</td>
</tr>
</table>
</form>
</div>
</div>

I'm still not confident with using find() or strstr() would help... ...if that's the case then why would anyone waste time creating HTML parsers?
You said you were trying to parse the file in C++. You didn't say anything about HTML. If you want to parse the file with C++ you use strstr() or find(). It's that simple.
It really is that simple. Hell, you could even do something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
std::ifstream htmlFile;
htmlFile.open("path/to/file.html");

if (htmlFile.is_open()) {
    std::cout << "Parsing... ";
    while (htmlFile.good() && !htmlFile.eof()) {
        std::string line;
        std::getline(line, htmlFile); // Note, I may have gotten those the wrong way around

        // Now parse the line.
    }
    std::cout << "done.\n";
} else { // Uh-Oh: we couldn't open the file!
    std::cout << "Couldn't open file \"path/to/file.html\"; check file exists and can be accessed." << std::endl;
    exit(1);
}
@Lamblion:
In my first post, I clearly stated that I'm having a real hard time trying to parse HTML in C++... ... ... ... I also stated that I was having trouble figuring out how to access the DOM elements and print the text content to a file

@chrisname:
Looping through a simple HTML file line-by-line would be okay with this method...but when feeding in 100's, 1000's of files, not to mention complex HTML file, would prove very time consuming...

Anyway, guys...I think I figured it out and it's working...

You may want to try using this script -

http://www.biterscripting.com/SS_WebPageToCSV.html


It extracts a table from a web page to a CSV. (Change it to generate TSV.) When I executed the following command in biterscripting



scr ss_WebPageToCSV.txt page("http://www.principal.com/banners/landing/aboutprincipal.htm") number(2)

I am getting the second table. You can modify that sample script to your specific requirements - I think it will be simpler that way.



Topic archived. No new replies allowed.