Saturday, February 25, 2012

Web Crawler (Part-1)

We will be using Python to build the search engine.

If you look at a web page, it contains a lot of information. There will be pictures, videos, buttons and what not. If we try to crawl them directly, it's a huge tedious task. What we will be crawling is the source of the web document. The easiest way to view this is Rightclick-> ViewSource or View Page Source on the webpage you are viewing. Actually, this is the only information, that a Web server sends when a browser requests the page. What we actually see is it's rendering by our Web Browser.

Now, our aim is to find all the links in a given page source of a website or more specifically from a single webpage. How the page source is "Given" will be a future topic and will be covered in latter posts. Also, let us first write, how to extract the first link that appears in that source page, and then in the next part of this topic, we will extract all the links. The Python program that does this is as follows

#!/usr/bin/python

a='<html><head></head><body> <a href="http://opencvuser.blogspot.com">Learn Open CV</a><a href="http://dileepwrites.blogspot.com">Dileep Writes</a></body></html>'
#Storing the source in 'a'

start=a.find('a href="')
#The start of ahref tag

#start+8 Considering no extra white spaces, will be the start of the URL

End=a.find('"',start+8)
#End of URL - where the quotes end

print a[start+8:End]
#Printing the URL without quotes

In the variable a, we are storing  a sample html source with two links. Links may be in many ways, but we are assuming that they will start in the most standard manner with <a href></a> tags.

Inorder to extract the first link from a , we are exploring the a using the string find method of Python. This method gives the first occurrence of the substring we are searching. The location is the offset of the substring from the beginning of the string. It's first argument is the string to be searched for and the second optional one, from which location to start searching. So, in start , we are storing the location where the a of a href occurs. After that, we need to find closing quotes of the URL to know where it ends. The starting quotes of the URL occur at location start+7 , so from there on, we can search for the End of the URL where the quotes end. Then we are printing the URL. The below one is the output

http://opencvuser.blogspot.com

Thus we have extracted the first link. In the next post I will tell you how to search and extract the all the URL's. It's just a little bit extension from this. Also, if any python gurus are out there, I would love to see this code much more simplified.

7 comments:

  1. Hello dileep kumar,
    i am trying to build a search engine in python too,
    i have completed 3 units udacity!!
    thanks a lot ur blog is really helpful :)

    ReplyDelete
  2. also i would like to know if u ever tried to develop it further? can it be developed to work like google or bing kind of sites.. this is a console version i knw.. still!!

    ReplyDelete
  3. Yes, same here I am working to develop a new search engine as Shallini I am taking CS101 at Udacity, the course and instructor DAVE are really awesome and your blog @Dileep Kumar is very helpful. thanks

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete