Page rank algorithm is the very first algorithm that Google employed. It helped Google become a leader among search engines, but the increase in the number of spammers led it to adopt newer variants of the algorithm. I implemented the basic one in the python code below. Compared to the previous post, only the changes and newer functions are posted here. The compute_ranks function computes the ranks of the pages i.e., all the links. A newer dictionary -

The results are as shown below

In the next post, I will post the whole code at one place and also, modify the get_page function so that you can use any page as seed page. Also, I will give an example of how our search engine works, if we have around 1000 websites crawled.

**graph**, which contains pages as keys and all the links that occur in each page as their corresponding url lists. This**graph**is returned by the Crawl_web function. A Look_up_new function is defined that first shows the ranking of the returned url's that contain the keyword, sorts them using the QuickSort routine and arranges them in descending order. Below is the code, most of it is pretty self explanatory.def compute_ranks(graph):#Computing ranks for a given graph -> for all the links in it d=0.8 numloops=10 ranks={} npages=len(graph) for page in graph: ranks[page]=1.0/npages for i in range(0,numloops): newranks={} for page in graph: newrank=(1-d)/npages for node in graph: if page in graph[node]: newrank=newrank+d*ranks[node]/len(graph[node]) newranks[page]=newrank ranks=newranks return ranks def Crawl_web(seed):#The website to act as seed page is given as input tocrawl=[seed] crawled=[] index={} graph={}#new graph while tocrawl: p=tocrawl.pop() if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again c=get_page(p) add_page_to_index(index,p,c) f=get_all_links(c) union(tocrawl,f) graph[p]=f crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far return crawled,index,graph #Returns the list of links crawled,index,graph=Crawl_web('http://xkcd.com/353')#printing all the links #print index def QuickSort(pages,ranks):#Sorting in descending order if len(pages)>1: piv=ranks[pages[0]] i=1 j=1 for j in range(1,len(pages)): if ranks[pages[j]]>piv: pages[i],pages[j]=pages[j],pages[i] i+=1 pages[i-1],pages[0]=pages[0],pages[i-1] QuickSort(pages[1:i],ranks) QuickSort(pages[i+1:len(pages)],ranks) def Look_up_new(index,ranks,keyword): pages=Look_up(index,keyword) for i in pages: print i+" --> "+str(ranks[i])#Displaying the lists, so that you can see the page rank along side QuickSort(pages,ranks) print "\nAfter Sorting the results by page rank\n" for i in pages:#This is how actually it looks like in search engine results - > sorted by page rank print i ranks=compute_ranks(graph) Look_up_new(index,ranks,"is")

The results are as shown below

```
http://xkcd.com/353 --> 0.00645161290323
http://xkcd.com/554 --> 0.00663594470046
After Sorting the results by page rank
http://xkcd.com/554
http://xkcd.com/353
```

In the next post, I will post the whole code at one place and also, modify the get_page function so that you can use any page as seed page. Also, I will give an example of how our search engine works, if we have around 1000 websites crawled.

## No comments:

## Post a Comment