Web Screen Scraper



I’m trying to write a screen scraper to get the Dividend information from a Morning Star page, URL:
http://quotes.morningstar.com/fund/fbiox/f?t=FBIOX

I'm using C++ to read the URL into memory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
string WebPage = "";
HINTERNET OpenAddress;
OpenAddress = InternetOpenUrlA ( hWeb , URL  ,NULL , 0 , 
   INTERNET_FLAG_PRAGMA_NOCACHE|INTERNET_FLAG_KEEP_CONNECTION , 0 );
 char DataReceived[4098];
 DWORD NumberOfBytesRead = 0;
 while(InternetReadFile(OpenAddress, DataReceived, 4096, 
                        &NumberOfBytesRead) && NumberOfBytesRead )
 {
     if ( NumberOfBytesRead > 0 )
     {
         DataReceived[NumberOfBytesRead] = 0;
         WebPage += DataReceived;
     }
 }
 InternetCloseHandle(OpenAddress);


When I display source for the page the area I want has the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<!-- distribution  start-->
  <div id="iddistribution">
    <div  class="gr_section_b2">
       <h2><div class="gr_row_b1 mt30" >
          <span class="gr_text_subhead" id="Dividend">Dividend 
                              and Capital Gains Distributions</span>
       <span class="gr_text_subhead">&nbsp;<span>AABPX</span></span>
    </div>
      </h2>
      <div id="DividendAndCaptical" class="gr_section_b1">
      </div>
    </div>
  </div>
<!-- distribution  end-->


When I display the page in my browser and select all and copy it to notepad I get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Dividend and Capital Gains Distributions  FBIOX
Distribution
Date   Distribution
NAV   Long-Term
Capital Gain   Short-Term
Capital Gain   Return of
Capital   Dividend
Income   Distribution
Total
12/05/2014   218.42   23.8400   0.0000   0.0000   0.0000   23.8400
12/06/2013   178.44   0.4050   0.0000   0.0000   0.0000   0.4050
04/12/2013   139.24   0.0000   0.0550   0.0000   0.0000   0.0550
12/07/2012   109.29   0.3520   0.7790   0.0000   0.0000   1.1310
04/13/2012   91.80   5.3500   0.0000   0.0000   0.0000   5.3500
Currency: USD


Obviously the "<div id="DividendAndCaptical" class="gr_section_b1">" in the raw page is what loads the data I want into memory. Does anyone know how to emulate a web browser and ask for the additional code?
Topic archived. No new replies allowed.