| Back to Web Crawler main page |
Web Crawler component
The main procedure is 'Crawl'. We call it recursively for each page we crawl.
GetHTMLFromSite(url,
Then we need to convert the HTML to XML so we can get a readable structure.
We create the HTML Parser object:
Noviway.HTMLParser.Parser parser = Noviway.HTMLParser.HTMLDocument doc = new Noviway.HTMLParser.HTMLDocument(logger, parser.Tags, encoding);doc.Process(); doc.CreateXml(out xmlDoc); Then we raise the NewPage event. We use a structure called WebPage to store the page data. This structure is passed to the callback function ( the event ). Now, we've got the XML document to parse. We go over all anchor tags: ArrayList list = doc.GetElementsByTagName("a"); DemoCreate the crawler object. Set the maximum crawling levels, WebPage parameters and the callback function:
Noviway.WebCrawler.Crawler crawler = new Noviway.WebCrawler.Crawler();
private void NewPageEvent(Noviway.WebCrawler.Crawler.WebPage page, int level) That's it, you can do whatever you please with the WebPage object you get. |
|
|||||||
|
|||||||
|