Sunday, March 4, 2012

Web Crawler (Part-2)

Today, we will extract all the links from a given webpage. The webpage is taken as one continuous string. For simplicity, I am taking the source of a smaller webpage. The below is the python code for it

#!/usr/bin/python

def get_next_url(page):
 start_link=page.find("a href")#To find the position of link - by finding position of ahref tag
 if(start_link==-1):#To check if links are absent, since find returns -1 if it couldn't find the search string - in this case "ahref"
  return None,0
 start_quote=page.find('"',start_link)
 end_quote=page.find('"',start_quote+1)
 url=page[start_quote+1:end_quote]
 return url,end_quote
#Returning the end quote so that the next input contains the rest of the page, from the end quote position

page='<html><head><title>My great website</title></head><body><p>Where the hell is every body. Why didn they <i>fall</i> on my site</p><a href="http://dileepwrites.wordpress.com">Dilstories</a><a href="http://opencvuser.blospot.com">Learn openCV</a><a href="http://systemspro.blogspot.com">Learn Systems Programming</a></body></html>'


while(True):
       url,n=get_next_url(page)
       page=page[n:]
       if url:#If it's none, the loop breaks
          print url
       else:
          break


First we have defined a procedure using the def. The input to that function is the rest of the webpage, that we need to find the links from. If we could not find any more links, the procedure returns None and 0. It can be used to stop exploring by using a break in the while loop of main function. Finally the output of the above programme is

http://dileepwrites.wordpress.com
http://opencvuser.blospot.com
http://systemspro.blogspot.com 


1 comment:

  1. You said we have to merged this with previous part to make it work.
    Do you mean we must write 📝 the both 2 in one python file and save or we can write in different file and if that how can we merged the 2. THANKS

    ReplyDelete