Building a search Engine with Python: Using of hash tables in Python

Sunday, April 1, 2012

Using of hash tables in Python

The use of hash tables improves the speed of search engine drastically. Python has inbuilt dictionaries for this purpose. The index can be any string or character or a number. So, we can store our keywords as indexes and the list of the urls in which it is present as value in the dictionary. The below is the modified code section from my previous post. I have commented where ever necessary. The comments for this post start with 'Hash'.

def Look_up(index,keyword):#This function is for given an index, it finds the keyword in the index and returns the list of links
 #f=[]
 if keyword in index:#Hash:Direct lookup, no need to iterate
  return index[keyword]
 return []
#The format of element in the index is <keyword>,[<List of urls that contain the keyword>]
def add_to_index(index,url,keyword):

 if keyword in index:
  if url not in index[keyword]:#Hash:To get rid of redundant urls
   index[keyword].append(url)
  return
 index[keyword]=[url]#Hash:A new hash entry
def add_page_to_index(index,url,content):#Adding the content of the webpage to the index
 for i in content.split():
  add_to_index(index,url,i)

def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index={}#Hash:Dictionary initialization
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   union(tocrawl,get_all_links(c))
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index #Returns the list of links
crawled,index=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 
print Look_up(index,"is")##Searching for the keyword "is"

The output is