Web Crawler and Search Bot - Help

Hi all,

I am an old programmer now making my return - old habits die hard

I used to work with C++ under Linux but haven't programmed in a few years since moving more to design and management. However, have a business idea I want to try out.

I am looking to crawl a specific website and all its sub-domains and trawl it for keywords, and do a count of the occurrence of that word under the domain, and store the result in a database.

For example , consider IBM's portal (a massive website), and I want to check how many webpages have the word "ThinkPad".

I have no idea where to start. Should I be looking at things like GNU Wget , Abot or what? Or, am I looking at writing a search engine? When you enter a word in google - it tells you the number of results and the time like "2,999 in 0.003second".

In simpleton terms - like running a command line to do a grep on a list of files and piping it into a wc (word count) - except i want to run it on a website and all its domains and files. I would like to be able to define my criteria in an XML file or rules file - something i can enhance and manipulate over time.

Where should I start?

Thanks,

cbf28
It's not all that hard. You only need two things: a networking library (I recommend SFML: http://sfml-dev.org/ ) and a HTML parser (I found this one: http://github.com/dhoerl/htmlcxx but I have never used it).

You'd have a class which takes a URL and downloads the page, counts the number of times the search string appears, and scans the downloaded page for URLs. For every URL that it finds, it creates a new instance and the cycle starts over again. Personally I would have each new instance spawn in a new thread (with a maximum number of threads), and I'd store the URLs already visited so that no URL would be downloaded or parsed twice.
Last edited on


Sort off. Is there a way to actually search the pages without actually downloading them? I assume I can like put the downloaded file in a buffer, search it, then wipe my buffer so I don't end up downloading the entire website. But that could use up my memory and webpages today no longer respect the 1024 bit (or was it something like that) where the size of the page doesn't crash the web browser. Surely no webpage would be bigger than a few MBs but I am not sure how expensive the search would be if you're downloading and searching many pages in parallel.

Google has farms and are indexing a bazillion keywords. Am not doing that - I just need specific keywords that are targeted.

In linux terms, if the file on a crawled website is not 'touched' then the bot skips it. So in theory, the size of the crawled web is elastic - i.e. once you've crawled a page you don't actually crawl it unless its changed. So in theory, I am not re-running the crawler on the entire domain every time because the search itself is expensive.


Does this make sense?
AFAIK you have to download the webpage. Otherwise your computer knows nothing about it, it doesn't make sense otherwise. Your browser downloads each webpage whenever you visit something.
@cbf28
Is the program only going to run on certain words, or does it need to be applicable to any given keyword?

@ResidentBiscuit
What he means (I think) is this: if the crawler runs on the same domain more than once, it should only re-download and re-index pages that have been modified since the last download.
Ah well that's simple enough. HTTP has a modified-since field you can check against.
Topic archived. No new replies allowed.