Web Crawler


 
Noviway > Source Code > Web Crawler

Download source code

A web crawler (also known as a web spider or web robot) is a program or automated script
which browses the in a methodical, automated manner. Other less frequently used names for
web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000).

With the web crawler you will be able to:

  • Extract data from a site that is constantly updating its data ( like CNN News or Yahoo News )
  • Build a search engine
  • Download a huge amount of files from web pages like MP3 files, code ( zip files ) etc
  • Build a monitoring bot that will alert you when specific conditions are met. For instance,
    let me know when Nasdac is below a specific value.

How does our Web Crawler work?

  • Choose a starting url, such as www.Noviway.com
  • Convert the HTML to XML using the HTML To XML component
  • Go over all anchor tags ( <A> )
  • For each anchor, extract the "href" attribute from the XML node
  • For each url you extracted, do the whole procedure ( back to stage 1 )
The web crawler componenet is very easy to use. For each page you crawled an event is
raised where you can do anything you want with the page, e.g store the HTML, process the XML etc.

 
Take a look at the source code...
 
 


NAVIGATION MENU
 
Statistics
TCP/IP Sniffer ( Basic )
Thread Pooling
HTTP Browsing
Winsock - Client & Server
Web Crawler
Binary Search
HTML To XML Converter
 
LATEST NEWS & EVENTS
 
Jul 01, 2009
Statistics - Bug fix. The project has also been converted to VS 2008.
 
May 28, 2009
Still working on the layout. Hag Sameah...
 
SPONSERED LINKS
 
Goldberg & Co - Management consulting be CEOs