Building a search Engine with Python: The Page rank algorithm

Page rank algorithm is the very first algorithm that Google employed. It helped Google become a leader among search engines, but the increase in the number of spammers led it to adopt newer variants of the algorithm. I implemented the basic one in the python code below. Compared to the previous post, only the changes and newer functions are posted here. The compute_ranks function computes the ranks of the pages i.e., all the links. A newer dictionary - graph , which contains pages as keys and all the links that occur in each page as their corresponding url lists. This graph is returned by the Crawl_web function. A Look_up_new function is defined that first shows the ranking of the returned url's that contain the keyword, sorts them using the QuickSort routine and arranges them in descending order. Below is the code, most of it is pretty self explanatory.

def compute_ranks(graph):#Computing ranks for a given graph -> for all the links in it
 d=0.8
 numloops=10
 ranks={}
 npages=len(graph)
 for page in graph:
  ranks[page]=1.0/npages
 for i in range(0,numloops):
  newranks={}
  for page in graph:
   newrank=(1-d)/npages
   for node in graph:
    if page in graph[node]:
     newrank=newrank+d*ranks[node]/len(graph[node])
   newranks[page]=newrank
  ranks=newranks
 return ranks
 
def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index={}
 graph={}#new graph
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   f=get_all_links(c)
   union(tocrawl,f)
   graph[p]=f
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index,graph #Returns the list of links
crawled,index,graph=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 



def QuickSort(pages,ranks):#Sorting in descending order
 if len(pages)>1:
  piv=ranks[pages[0]]
  i=1
  j=1
  for j in range(1,len(pages)):
   if ranks[pages[j]]>piv:
    pages[i],pages[j]=pages[j],pages[i]
    i+=1
  pages[i-1],pages[0]=pages[0],pages[i-1]
  QuickSort(pages[1:i],ranks)
  QuickSort(pages[i+1:len(pages)],ranks)

def Look_up_new(index,ranks,keyword):
 pages=Look_up(index,keyword)
 for i in pages:
  print i+" --> "+str(ranks[i])#Displaying the lists, so that you can see the page rank along side
 QuickSort(pages,ranks)
 print "\nAfter Sorting the results by page rank\n"
 for i in pages:#This is how actually it looks like in search engine results - > sorted by page rank
  print i 
ranks=compute_ranks(graph)
Look_up_new(index,ranks,"is")

The results are as shown below

http://xkcd.com/353 --> 0.00645161290323
http://xkcd.com/554 --> 0.00663594470046

After Sorting the results by page rank

http://xkcd.com/554
http://xkcd.com/353

In the next post, I will post the whole code at one place and also, modify the get_page function so that you can use any page as seed page. Also, I will give an example of how our search engine works, if we have around 1000 websites crawled.

Building a search Engine with Python

Pages

Wednesday, April 4, 2012

The Page rank algorithm

No comments:

Post a Comment