Building a search Engine with Python: April 2012

Sunday, April 15, 2012

The Search Engine

The search engine is complete now. Although I have to improve it, so that the search responds to multi-word queries. For now, it responds to only single word queries. I made the code available in https://github.com/dileep98490/A-simple-Search-Engine-in-Python along with a README file. Till the previous post, I have been using a custom get_page(url) module, which returns the source of specific webpages. I changed the get_page(url), so that it returns the source of any url you pass into it as input. Below is the modified get_page(url)

import urllib

def get_page(url):#This function is just to return the webpage contents; the source of the webpage when a url is given.
 try:
  f = urllib.urlopen(url)
  page = f.read()
  f.close()
  #print page
  return page
 except: 
  return ""
 return ""

Also, I have created a input taking mechanism such that any page can be given as seed page. Also, the query is given as input, along with the maximum links to be checked as depth. The output is as shown below (the full source code is available on git-hub repository I mentioned above).

Enter the seed page
http://opencvuser.blogspot.com
Enter What you want to search
is
Enter the depth you wanna go
5

Started crawling, presently at depth..
4
3
2
1
0

Printing the results as is with page rank

http://opencvuser.blogspot.com --> 0.05
https://skydrive.live.com/redir.aspx?cid=124ec5b5bc117437&amp;resid=124EC5B5BC11
7437!161&amp;parid=124EC5B5BC117437!103&amp;authkey=!AMnmS6xJcSrXSyg --> 0.05142
85714286

After Sorting the results by page rank

1.      https://skydrive.live.com/redir.aspx?cid=124ec5b5bc117437&amp;resid=124E
C5B5BC117437!161&amp;parid=124EC5B5BC117437!103&amp;authkey=!AMnmS6xJcSrXSyg

2.      http://opencvuser.blogspot.com

The two results were sorted later as per their page rank. I have taken the depth as 5, so the program crawled 5 links and out of them 2 contain our keyword "is"

Wednesday, April 4, 2012

The Page rank algorithm

Page rank algorithm is the very first algorithm that Google employed. It helped Google become a leader among search engines, but the increase in the number of spammers led it to adopt newer variants of the algorithm. I implemented the basic one in the python code below. Compared to the previous post, only the changes and newer functions are posted here. The compute_ranks function computes the ranks of the pages i.e., all the links. A newer dictionary - graph , which contains pages as keys and all the links that occur in each page as their corresponding url lists. This graph is returned by the Crawl_web function. A Look_up_new function is defined that first shows the ranking of the returned url's that contain the keyword, sorts them using the QuickSort routine and arranges them in descending order. Below is the code, most of it is pretty self explanatory.

def compute_ranks(graph):#Computing ranks for a given graph -> for all the links in it
 d=0.8
 numloops=10
 ranks={}
 npages=len(graph)
 for page in graph:
  ranks[page]=1.0/npages
 for i in range(0,numloops):
  newranks={}
  for page in graph:
   newrank=(1-d)/npages
   for node in graph:
    if page in graph[node]:
     newrank=newrank+d*ranks[node]/len(graph[node])
   newranks[page]=newrank
  ranks=newranks
 return ranks
 
def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index={}
 graph={}#new graph
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   f=get_all_links(c)
   union(tocrawl,f)
   graph[p]=f
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index,graph #Returns the list of links
crawled,index,graph=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 



def QuickSort(pages,ranks):#Sorting in descending order
 if len(pages)>1:
  piv=ranks[pages[0]]
  i=1
  j=1
  for j in range(1,len(pages)):
   if ranks[pages[j]]>piv:
    pages[i],pages[j]=pages[j],pages[i]
    i+=1
  pages[i-1],pages[0]=pages[0],pages[i-1]
  QuickSort(pages[1:i],ranks)
  QuickSort(pages[i+1:len(pages)],ranks)

def Look_up_new(index,ranks,keyword):
 pages=Look_up(index,keyword)
 for i in pages:
  print i+" --> "+str(ranks[i])#Displaying the lists, so that you can see the page rank along side
 QuickSort(pages,ranks)
 print "\nAfter Sorting the results by page rank\n"
 for i in pages:#This is how actually it looks like in search engine results - > sorted by page rank
  print i 
ranks=compute_ranks(graph)
Look_up_new(index,ranks,"is")

The results are as shown below

http://xkcd.com/353 --> 0.00645161290323
http://xkcd.com/554 --> 0.00663594470046

After Sorting the results by page rank

http://xkcd.com/554
http://xkcd.com/353

In the next post, I will post the whole code at one place and also, modify the get_page function so that you can use any page as seed page. Also, I will give an example of how our search engine works, if we have around 1000 websites crawled.

Sunday, April 1, 2012

Using of hash tables in Python

The use of hash tables improves the speed of search engine drastically. Python has inbuilt dictionaries for this purpose. The index can be any string or character or a number. So, we can store our keywords as indexes and the list of the urls in which it is present as value in the dictionary. The below is the modified code section from my previous post. I have commented where ever necessary. The comments for this post start with 'Hash'.

def Look_up(index,keyword):#This function is for given an index, it finds the keyword in the index and returns the list of links
 #f=[]
 if keyword in index:#Hash:Direct lookup, no need to iterate
  return index[keyword]
 return []
#The format of element in the index is <keyword>,[<List of urls that contain the keyword>]
def add_to_index(index,url,keyword):

 if keyword in index:
  if url not in index[keyword]:#Hash:To get rid of redundant urls
   index[keyword].append(url)
  return
 index[keyword]=[url]#Hash:A new hash entry
def add_page_to_index(index,url,content):#Adding the content of the webpage to the index
 for i in content.split():
  add_to_index(index,url,i)

def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index={}#Hash:Dictionary initialization
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   union(tocrawl,get_all_links(c))
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index #Returns the list of links
crawled,index=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 
print Look_up(index,"is")##Searching for the keyword "is"

The output is

http://xkcd.com/353
http://xkcd.com/554

Pages