Building a search Engine with Python: February 2012

Saturday, February 25, 2012

Web Crawler (Part-1)

We will be using Python to build the search engine.

If you look at a web page, it contains a lot of information. There will be pictures, videos, buttons and what not. If we try to crawl them directly, it's a huge tedious task. What we will be crawling is the source of the web document. The easiest way to view this is Rightclick-> ViewSource or View Page Source on the webpage you are viewing. Actually, this is the only information, that a Web server sends when a browser requests the page. What we actually see is it's rendering by our Web Browser.

Now, our aim is to find all the links in a given page source of a website or more specifically from a single webpage. How the page source is "Given" will be a future topic and will be covered in latter posts. Also, let us first write, how to extract the first link that appears in that source page, and then in the next part of this topic, we will extract all the links. The Python program that does this is as follows

#!/usr/bin/python

a='<html><head></head><body> <a href="http://opencvuser.blogspot.com">Learn Open CV</a><a href="http://dileepwrites.blogspot.com">Dileep Writes</a></body></html>'
#Storing the source in 'a'

start=a.find('a href="')
#The start of ahref tag

#start+8 Considering no extra white spaces, will be the start of the URL

End=a.find('"',start+8)
#End of URL - where the quotes end

print a[start+8:End]
#Printing the URL without quotes

In the variable a, we are storing a sample html source with two links. Links may be in many ways, but we are assuming that they will start in the most standard manner with <a href></a> tags.

Inorder to extract the first link from a , we are exploring the a using the string find method of Python. This method gives the first occurrence of the substring we are searching. The location is the offset of the substring from the beginning of the string. It's first argument is the string to be searched for and the second optional one, from which location to start searching. So, in start , we are storing the location where the a of a href occurs. After that, we need to find closing quotes of the URL to know where it ends. The starting quotes of the URL occur at location start+7 , so from there on, we can search for the End of the URL where the quotes end. Then we are printing the URL. The below one is the output

http://opencvuser.blogspot.com

Thus we have extracted the first link. In the next post I will tell you how to search and extract the all the URL's. It's just a little bit extension from this. Also, if any python gurus are out there, I would love to see this code much more simplified.

Making a web crawler

To build a search engine, first, we need to have a web crawler. What is a web crawler ? What does it do ?

A web crawler is a software program, that helps the search engines in gathering data from millions of websites with less human intervention. People often call them Google bots, since those were like robots fetching the data and Google is the most famous search engine.

So, let's look at how a web crawler operates. A simple algorithm for a web crawler can be

Given a link, it
1. Crawls the webpage, gathering text data
2. Collects the links that are pointing to other web documents (websites)
3. For each of these links, it again goes to step-1

It automatically fetches millions of web documents like this, giving us more data than we can actually handle. But, for search, that much data is essential, since we may not know, which data a user might expect a search engine to give. No wonder, why Google keeps on building large data centers around the world, storing this data and converting it into useful information.

So, to build a search engine, our first task is to build a web crawler. The next few posts, will help you in building a successful web crawler

Thursday, February 23, 2012

Early mans of the Computer generation

Just came to know about two people, who were likely to be the early mans to our present Computer generation.

The first one is Grace Hooper, who wrote the first compiler and showed that computers can do many things beside just arithmetic. Below is her photograph with the UNIVAC machine. You can look at what she is holding and tell the language, she wrote compiler for :P

In 1952, she had an operational compiler. She gives a statement later, about the situation at that time

" Nobody believed that,I had a running compiler and nobody would touch it. They told me computers could only do arithmetic "

The second one goes as far back as 1840's. At that time, there was not even a computer, there was just a design that Babbage had for building a computer. But a Lady thought of programming it when it get's ready. She was Ada Lovelace and arguably, the first programmer in the whole world. Her notes, contained an algorithm to compute Bernoulli's numbers on Babbage's machine. Below is a portrait of her

Sorry for the title, they were actually early womans

Wednesday, February 22, 2012

Speed of Computer

Ever wondered how much speed your computer is running or What does the clock speed signify or Why your processor is that much small ?

It's time to clear your brain.

The speed of light is 3*10^8 meters/sec. If you convert it to centimeters, it will be 3*10^10. Let's see how much distance it can travel in a nano second. It will be 3*10^10*10^-9, which will be 30 Cm.

Now if the processor speed of your computer is 2.7 GHZ, it tells us that it can execute 2.7 Billion cycles per second; which inturn implies that a cycle takes (1/2.7)*(1/10^9) seconds to complete. To make it look in terms of distance travelled by light, we already have 30 Cm, the distance that the light travels in (1/10^9) seconds or a nano second. Now, if we multiply it with (1/2.7) to see how much light can travel in the time the processor completes one cycle, we get 11.11 Cm.

So, as the processor completes one cycle, the light would have journeyed 11 Cm. In an other way, as the light travels 11 Cm, the processor would have completed one instruction or a part of that instruction (a cycle). Hence, processors are made small to accommodate high speed. So the next time some Physics guy speaks about the greatness of light over computers, make sure to brain wash him with this information.

Tuesday, February 21, 2012

Hello People

Hello everyone,

I am a Computer Science undergrad from India. I am learning to build a search engine from CS101 class offered by David Evans (Professor of Computer Science at the University of Virginia) and Sebastian Thrun (Research Professor of Computer Science at Stanford University) from the website of

http://www.udacity.com/

I will be posting here, the cool things I am learning. Mainly, I will use it as a notes in the process of building my own search engine.

Pages