Saturday, February 25, 2012

Making a web crawler

To build a search engine, first, we need to have a web crawler. What is a web crawler ? What does it do ?

A web crawler is a software program, that helps the search engines in gathering data from millions of websites with less human intervention. People often call them Google bots, since those were like robots fetching the data and Google is the most famous search engine.

So, let's look at how a web crawler operates. A simple algorithm for a web crawler can be

Given a link, it
1. Crawls the webpage, gathering text data
2. Collects the links that are pointing to other web documents (websites)
3. For each of these links, it again goes to step-1

It automatically fetches millions of web documents like this, giving us more data than we can actually handle. But, for search, that much data is essential, since we may not know, which data a user might expect a search engine to give. No wonder, why Google keeps on building large data centers around the world, storing this data and converting it into useful information.

So, to build a search engine, our first task is to build a web crawler. The next few posts, will help you in building a successful web crawler

No comments:

Post a Comment