Home Page   |   Products   |   Customer Service   |   About Us   |   Contact Us   |   Search
 
Home > Source Code > Web Crawler
 

Web Crawler


Download source code

A web crawler (also known as a web spider or web robot) is a program or automated script
which browses the in a methodical, automated manner. Other less frequently used names for
web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000).

With the web crawler you will be able to:

  • Extract data from a site that is constantly updating its data ( like CNN News or Yahoo News )
  • Build a search engine
  • Download a huge amount of files from web pages like MP3 files, code ( zip files ) etc
  • Build a monitoring bot that will alert you when specific conditions are met. For instance,
    let me know when Nasdac is below a specific value.

How does our Web Crawler work?

  • Choose a starting url, such as www.Noviway.com
  • Convert the HTML to XML using the HTML To XML component
  • Go over all anchor tags ( <A> )
  • For each anchor, extract the "href" attribute from the XML node
  • For each url you extracted, do the whole procedure ( back to stage 1 )
The web crawler componenet is very easy to use. For each page you crawled an event is
raised where you can do anything you want with the page, e.g store the HTML, process the XML etc.

 
Take a look at the source code...
 
 
Share with others:   
 
  Webmaster: Eran Aharonovich © All rights reserved to Eran Aharonovich 2007