Home Page   |   Products   |   Customer Service   |   About Us   |   Contact Us   |   Search

Web Crawler - Source Code

 
Back to Web Crawler main page
 

Web Crawler component

The main procedure is 'Crawl'. We call it recursively for each page we crawl.
First, we browse the desired url and get the HTML out.
We also get the original buffer of the page and then content type:

GetHTMLFromSite(url, ref html, out encoding, string.Empty, out buffer, ref contentType);

Then we need to convert the HTML to XML so we can get a readable structure. We create the HTML Parser object:

Noviway.HTMLParser.Parser parser = new Noviway.HTMLParser.Parser(logger, html);

parser.Process();

After we processed the data we can convert it to XML:

Noviway.HTMLParser.HTMLDocument doc = new Noviway.HTMLParser.HTMLDocument(logger, parser.Tags, encoding);

doc.Process();


doc.CreateXml(
out xmlDoc);

Then we raise the NewPage event. We use a structure called WebPage to store the page data.
This structure is passed to the callback function ( the event ).

Now, we've got the XML document to parse.
We go over all anchor tags:

ArrayList list = doc.GetElementsByTagName("a");

Demo

Create the crawler object. Set the maximum crawling levels, WebPage parameters and the callback function:

Noviway.WebCrawler.Crawler crawler = new Noviway.WebCrawler.Crawler();

crawler.MaxCrawlingLevels = 3
crawler.SaveBuffer = true;
crawler.SaveHTML = true;
crawler.NewPageEvent = new Noviway.WebCrawler.Crawler.NewPageCallback(NewPageEvent);
crawler.Crawl("http://www.noviway.com");

Create a callback function. This function will receive notifications about new web pages crawled.

private void NewPageEvent(Noviway.WebCrawler.Crawler.WebPage page, int level)

That's it, you can do whatever you please with the WebPage object you get.


 
Share with others:   
 
  Webmaster: Eran Aharonovich © All rights reserved to Eran Aharonovich 2007