Building a search Engine with Python

Sunday, April 15, 2012

The Search Engine

The search engine is complete now. Although I have to improve it, so that the search responds to multi-word queries. For now, it responds to only single word queries. I made the code available in https://github.com/dileep98490/A-simple-Search-Engine-in-Python along with a README file. Till the previous post, I have been using a custom get_page(url) module, which returns the source of specific webpages. I changed the get_page(url), so that it returns the source of any url you pass into it as input. Below is the modified get_page(url)

import urllib

def get_page(url):#This function is just to return the webpage contents; the source of the webpage when a url is given.
 try:
  f = urllib.urlopen(url)
  page = f.read()
  f.close()
  #print page
  return page
 except: 
  return ""
 return ""

Also, I have created a input taking mechanism such that any page can be given as seed page. Also, the query is given as input, along with the maximum links to be checked as depth. The output is as shown below (the full source code is available on git-hub repository I mentioned above).

Enter the seed page
http://opencvuser.blogspot.com
Enter What you want to search
is
Enter the depth you wanna go
5

Started crawling, presently at depth..
4
3
2
1
0

Printing the results as is with page rank

http://opencvuser.blogspot.com --> 0.05
https://skydrive.live.com/redir.aspx?cid=124ec5b5bc117437&amp;resid=124EC5B5BC11
7437!161&amp;parid=124EC5B5BC117437!103&amp;authkey=!AMnmS6xJcSrXSyg --> 0.05142
85714286

After Sorting the results by page rank

1.      https://skydrive.live.com/redir.aspx?cid=124ec5b5bc117437&amp;resid=124E
C5B5BC117437!161&amp;parid=124EC5B5BC117437!103&amp;authkey=!AMnmS6xJcSrXSyg

2.      http://opencvuser.blogspot.com

The two results were sorted later as per their page rank. I have taken the depth as 5, so the program crawled 5 links and out of them 2 contain our keyword "is"

Wednesday, April 4, 2012

The Page rank algorithm

Page rank algorithm is the very first algorithm that Google employed. It helped Google become a leader among search engines, but the increase in the number of spammers led it to adopt newer variants of the algorithm. I implemented the basic one in the python code below. Compared to the previous post, only the changes and newer functions are posted here. The compute_ranks function computes the ranks of the pages i.e., all the links. A newer dictionary - graph , which contains pages as keys and all the links that occur in each page as their corresponding url lists. This graph is returned by the Crawl_web function. A Look_up_new function is defined that first shows the ranking of the returned url's that contain the keyword, sorts them using the QuickSort routine and arranges them in descending order. Below is the code, most of it is pretty self explanatory.

def compute_ranks(graph):#Computing ranks for a given graph -> for all the links in it
 d=0.8
 numloops=10
 ranks={}
 npages=len(graph)
 for page in graph:
  ranks[page]=1.0/npages
 for i in range(0,numloops):
  newranks={}
  for page in graph:
   newrank=(1-d)/npages
   for node in graph:
    if page in graph[node]:
     newrank=newrank+d*ranks[node]/len(graph[node])
   newranks[page]=newrank
  ranks=newranks
 return ranks
 
def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index={}
 graph={}#new graph
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   f=get_all_links(c)
   union(tocrawl,f)
   graph[p]=f
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index,graph #Returns the list of links
crawled,index,graph=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 



def QuickSort(pages,ranks):#Sorting in descending order
 if len(pages)>1:
  piv=ranks[pages[0]]
  i=1
  j=1
  for j in range(1,len(pages)):
   if ranks[pages[j]]>piv:
    pages[i],pages[j]=pages[j],pages[i]
    i+=1
  pages[i-1],pages[0]=pages[0],pages[i-1]
  QuickSort(pages[1:i],ranks)
  QuickSort(pages[i+1:len(pages)],ranks)

def Look_up_new(index,ranks,keyword):
 pages=Look_up(index,keyword)
 for i in pages:
  print i+" --> "+str(ranks[i])#Displaying the lists, so that you can see the page rank along side
 QuickSort(pages,ranks)
 print "\nAfter Sorting the results by page rank\n"
 for i in pages:#This is how actually it looks like in search engine results - > sorted by page rank
  print i 
ranks=compute_ranks(graph)
Look_up_new(index,ranks,"is")

The results are as shown below

http://xkcd.com/353 --> 0.00645161290323
http://xkcd.com/554 --> 0.00663594470046

After Sorting the results by page rank

http://xkcd.com/554
http://xkcd.com/353

In the next post, I will post the whole code at one place and also, modify the get_page function so that you can use any page as seed page. Also, I will give an example of how our search engine works, if we have around 1000 websites crawled.

Sunday, April 1, 2012

Using of hash tables in Python

The use of hash tables improves the speed of search engine drastically. Python has inbuilt dictionaries for this purpose. The index can be any string or character or a number. So, we can store our keywords as indexes and the list of the urls in which it is present as value in the dictionary. The below is the modified code section from my previous post. I have commented where ever necessary. The comments for this post start with 'Hash'.

def Look_up(index,keyword):#This function is for given an index, it finds the keyword in the index and returns the list of links
 #f=[]
 if keyword in index:#Hash:Direct lookup, no need to iterate
  return index[keyword]
 return []
#The format of element in the index is <keyword>,[<List of urls that contain the keyword>]
def add_to_index(index,url,keyword):

 if keyword in index:
  if url not in index[keyword]:#Hash:To get rid of redundant urls
   index[keyword].append(url)
  return
 index[keyword]=[url]#Hash:A new hash entry
def add_page_to_index(index,url,content):#Adding the content of the webpage to the index
 for i in content.split():
  add_to_index(index,url,i)

def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index={}#Hash:Dictionary initialization
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   union(tocrawl,get_all_links(c))
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index #Returns the list of links
crawled,index=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 
print Look_up(index,"is")##Searching for the keyword "is"

The output is

http://xkcd.com/353
http://xkcd.com/554

Friday, March 16, 2012

A Simple Search Engine

I have built a simple search engine as a follow up to my previous posts. This search engine, returns the list of url's that contain a specific keyword in the pages the engine had crawled till then. The functions of the last post were reused except that the Crawl_web function is modified. Along with it, three more new functions are added. The four of them are as shown below. So, in the main function, we first let the function crawl, giving a specific url as seed, forming an index. Then we go for looking the word "how" in the index (We can search for any keyword, just for demonstration I am using how). The Look_up function helps us in our later task returning a list of urls that contain our keyword.

def Look_up(index,keyword):#This function is for given an index, it finds the keyword in the index and returns the list of links
 f=[]
 for i in index:
  if i[0]==keyword:
   for j in i[1]:
    f.append(j)
 return f
#The format of element in the index is <keyword>,[<List of urls that contain the keyword>]
def add_to_index(index,url,keyword):
 for i in index:
  if keyword==i[0]:
   i[1].append(url)
   return
 index.append([keyword,[url]])
def add_page_to_index(index,url,content):#Adding the content of the webpage to the index
 for i in content.split():
  add_to_index(index,url,i)

def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index=[]
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   union(tocrawl,get_all_links(c))
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index #Returns the list of links
crawled,index=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 
print Look_up(index,"how")#We are looking for the keyword "how" in the index

The output of the above code is as shown below. Please note that the above code is not complete, but had to be merged with the code previous post to be a complete one.

['http://xkcd.com/353']

Monday, March 12, 2012

Web Crawler (Part-3)

This post give you a way to extract all the links. What this program does, it will be better to explain in a step by step manner.

1. A url of the seed page is taken and passed to the Crawl_web function
2. The Crawl_web function takes the passed url as input string
2. It crawl the entire webpage with the given url and finds all the url's with <a href> tag in the page
3. Then it takes one of those url's, sees that it was not already crawled and goes to step-1. If all the url's are over, then it goes to the next step
4. The Crawl_web function returns the list of all the url's obtained by crawling starting from the given seed page
5. We print the url's

The program in python is as follows

#!/usr/bin/python
#Author : dileep98490@gmail.com


def get_page(url):#This function is just to return the webpage contents; the source of the webpage when a url is given. Note that this is not the original function and it is used to just serve the purpose here. To return the source of certain url's
    try:
        if url == "http://xkcd.com/353":
            return  """<?xml version="1.0" encoding="utf-8" ?><?xml-stylesheet href="http://imgs.xkcd.com/s/c40a9f8.css" type="text/css" media="screen" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>xkcd: Python</title> <link rel="stylesheet" type="text/css" href="http://imgs.xkcd.com/s/c40a9f8.css" media="screen" title="Default" /> <!--[if IE]><link rel="stylesheet" type="text/css" href="http://imgs.xkcd.com/s/ecbbecc.css" media="screen" title="Default" /><![endif]--> <link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml" /> <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml" /> <link rel="icon" href="http://imgs.xkcd.com/s/919f273.ico" type="image/x-icon" /> <link rel="shortcut icon" href="http://imgs.xkcd.com/s/919f273.ico" type="image/x-icon" /> </head> <body> <div id="container"> <div id="topContainer"> <div id="topLeft" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s">\t<ul> <li><a href="http://xkcd.com/554"">Archive</a><br /></li>\t <li><a href="http://blag.xkcd.com/">News/Blag</a><br /></li> <li><a href="http://store.xkcd.com/">Store</a><br /></li> <li><a href="/about/">About</a><br /></li> <li><a href="http://forums.xkcd.com/">Forums</a><br /></li> </ul> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="topRight" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <div id="topRightContainer"> <div id="logo"> <a href="/"><img src="http://imgs.xkcd.com/s/9be30a7.png" alt="xkcd.com logo" height="83" width="185"/></a> <h2><br />A webcomic of romance,<br/> sarcasm, math, and language.</h2> <div class="clearleft"></div> <br />XKCD updates every Monday, Wednesday, and Friday. </div> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> <div id="contentContainer"> <div id="middleContent" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"><h1>Python</h1><br/><br /><div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/352/" accesskey="p">&lt; Prev</a></li> <li><a href="http://dynamic.xkcd.com/random/comic/" id="rnd_btn_t">Random</a></li> <li><a href="/354/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul></div><br/><br/><img src="http://imgs.xkcd.com/comics/python.png" title="I wrote 20 short programs in Python yesterday. It was wonderful. Perl, Im leaving you." alt="Python" /><br/><br/><div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/352/" accesskey="p">&lt; Prev</a></li> <li><a href="http://dynamic.xkcd.com/random/comic/" id="rnd_btn_b">Random</a></li> <li><a href="/354/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul></div><h3>Permanent link to this comic: http://xkcd.com/353/</h3><h3>Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/python.png</h3><div id="transcript" style="display: none">[[ Guy 1 is talking to Guy 2, who is floating in the sky ]]Guy 1: You39;re flying! How?Guy 2: Python!Guy 2: I learned it last night! Everything is so simple!Guy 2: Hello world is just 39;print &quot;Hello, World!&quot; 39;Guy 1: I dunno... Dynamic typing? Whitespace?Guy 2: Come join us! Programming is fun again! It39;s a whole new world up here!Guy 1: But how are you flying?Guy 2: I just typed 39;import antigravity39;Guy 1: That39;s it?Guy 2: ...I also sampled everything in the medicine cabinet for comparison.Guy 2: But i think this is the python.{{ I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I39;m leaving you. }}</div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="middleFooter" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <img src="http://imgs.xkcd.com/s/a899e84.jpg" width="520" height="100" alt="Selected Comics" usemap=" comicmap" /> <map name="comicmap"> <area shape="rect" coords="0,0,100,100" href="/150/" alt="Grownups" /> <area shape="rect" coords="104,0,204,100" href="/730/" alt="Circuit Diagram" /> <area shape="rect" coords="208,0,308,100" href="/162/" alt="Angular Momentum" /> <area shape="rect" coords="312,0,412,100" href="/688/" alt="Self-Description" /> <area shape="rect" coords="416,0,520,100" href="/556/" alt="Alternative Energy Revolution" /> </map><br/><br />Search comic titles and transcripts:<br /><script type="text/javascript" src="//www.google.com/jsapi"></script><script type="text/javascript"> google.load(\"search\", \"1\"); google.setOnLoadCallback(function() { google.search.CustomSearchControl.attachAutoCompletion( \"012652707207066138651:zudjtuwe28q\", document.getElementById(\"q\"), \"cse-search-box\"); });</script><form action="//www.google.com/cse" id="cse-search-box"> <div> <input type="hidden" name="cx" value="012652707207066138651:zudjtuwe28q" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" name="q" id="q" autocomplete="off" size="31" /> <input type="submit" name="sa" value="Search" /> </div></form><script type="text/javascript" src="//www.google.com/cse/brand?form=cse-search-box&lang=en"></script><a href="/rss.xml">RSS Feed</a> - <a href="/atom.xml">Atom Feed</a><br /> <br/> <div id="comicLinks"> Comics I enjoy:<br/> <a href="http://www.qwantz.com">Dinosaur Comics</a>, <a href="http://www.asofterworld.com">A Softer World</a>, <a href="http://pbfcomics.com/">Perry Bible Fellowship</a>, <a href="http://www.boltcity.com/copper/">Copper</a>, <a href="http://questionablecontent.net/">Questionable Content</a>, <a href="http://achewood.com/">Achewood</a>, <a href="http://wondermark.com/">Wondermark</a>, <a href="http://thisisindexed.com/">Indexed</a>, <a href="http://www.buttercupfestival.com/buttercupfestival.htm">Buttercup Festival</a> </div> <br/> Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).<br/> <br/> <h4>We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves. <br />The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.<br />This is not the algorithm. This is close.</h4><br/> <div class="line"></div> <br/> <div id="licenseText"> <!-- <a rel="license" href="http://creativecommons.org/licenses/by-nc/2.5/"><img alt="Creative Commons License" style="border:none" src="http://imgs.xkcd.com/static/somerights20.png" /></a><br/> --> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/2.5/">Creative Commons Attribution-NonCommercial 2.5 License</a>.<!-- <rdf:RDF xmlns="http://web.resource.org/cc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns "><Work rdf:about=""><dc:creator>Randall Munroe</dc:creator><dcterms:rightsHolder>Randall Munroe</dcterms:rightsHolder><dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" /><dc:source rdf:resource="http://www.xkcd.com/"/><license rdf:resource="http://creativecommons.org/licenses/by-nc/2.5/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc/2.5/"><permits rdf:resource="http://web.resource.org/cc/Reproduction" /><permits rdf:resource="http://web.resource.org/cc/Distribution" /><requires rdf:resource="http://web.resource.org/cc/Notice" /><requires rdf:resource="http://web.resource.org/cc/Attribution" /><prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /><permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" /></License></rdf:RDF> --> <br/> This means you\"re free to copy and share these comics (but not to sell them). <a href="/license.html">More details</a>.<br/> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> </div> </body></html> """
        elif url == "http://xkcd.com/554":
            return  """<?xml version="1.0" encoding="utf-8" ?> <?xml-stylesheet href="http://imgs.xkcd.com/s/c40a9f8.css" type="text/css" media="screen" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>xkcd: Not Enough Work</title> <link rel="stylesheet" type="text/css" href="http://imgs.xkcd.com/s/c40a9f8.css" media="screen" title="Default" /> <!--[if IE]><link rel="stylesheet" type="text/css" href="http://imgs.xkcd.com/s/ecbbecc.css" media="screen" title="Default" /><![endif]--> <link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml" /> <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml" /> <link rel="icon" href="http://imgs.xkcd.com/s/919f273.ico" type="image/x-icon" /> <link rel="shortcut icon" href="http://imgs.xkcd.com/s/919f273.ico" type="image/x-icon" /> </head> <body> <div id="container"> <div id="topContainer"> <div id="topLeft" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <ul> <li><a href="/archive/">Archive</a><br /></li> <li><a href="http://blag.xkcd.com/">News/Blag</a><br /></li> <li><a href="http://store.xkcd.com/">Store</a><br /></li> <li><a href="/about/">About</a><br /></li> <li><a href="http://forums.xkcd.com/">Forums</a><br /></li> </ul> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="topRight" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <div id="topRightContainer"> <div id="logo"> <a href="/"><img src="http://imgs.xkcd.com/s/9be30a7.png" alt="xkcd.com logo" height="83" width="185"/></a> <h2><br />A webcomic of romance,<br/> sarcasm, math, and language.</h2> <div class="clearleft"></div> XKCD updates every Monday, Wednesday, and Friday. <br /> Blag: Remember geohashing? <a href="http://blog.xkcd.com/2012/02/27/geohashing-2/">Something pretty cool</a> happened Sunday. </div> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> <div id="contentContainer"> <div id="middleContent" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <h1>Not Enough Work</h1><br/> <br /> <div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/553/" accesskey="p">&lt; Prev</a></li> <li><a href="http://dynamic.xkcd.com/random/comic/" id="rnd_btn_t">Random</a></li> <li><a href="/555/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul> </div> <br/> <br/> <img src="http://imgs.xkcd.com/comics/not_enough_work.png" title="It39;s even harder if you39;re an asshole who pronounces &lt;&gt; brackets." alt="Not Enough Work" /><br/> <br/> <div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/553/" accesskey="p">&lt; Prev</a></li> <li><a href="http://dynamic.xkcd.com/random/comic/" id="rnd_btn_b">Random</a></li> <li><a href="/555/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul> </div> <h3>Permanent link to this comic: http://xkcd.com/554/</h3> <h3>Image URL (for hotlinking/embedding): http://imgs.xkcd.com/comics/not_enough_work.png</h3> <div id="transcript" style="display: none">Narration: Signs your coders don39;t have enough work to do: [[A man sitting at his workstation; a female co-worker behind him]] Man: I39;m almost up to my old typing speed in dvorak [[Two men standing by a server rack]] Man  1: Our servers now support gopher. Man  1: Just in case. [[A woman standing near her workstation speaking to a male co-worker]] Woman: Our pages are now HTML, XHTML-STRICT, and haiku-compliant Man: Haiku? Woman: &lt;div class=&quot;main&quot;&gt; Woman: &lt;span id=&quot;marquee&quot;&gt; Woman: Blog!&lt; span&gt;&lt; div&gt; [[A woman sitting at her workstation]] Woman: Hey! Have you guys seen this webcomic? {{title text: It39;s even harder if you39;re an asshole who pronounces &lt;&gt; brackets.}}</div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="middleFooter" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <img src="http://imgs.xkcd.com/s/a899e84.jpg" width="520" height="100" alt="Selected Comics" usemap=" comicmap" /> <map name="comicmap"> <area shape="rect" coords="0,0,100,100" href="/150/" alt="Grownups" /> <area shape="rect" coords="104,0,204,100" href="/730/" alt="Circuit Diagram" /> <area shape="rect" coords="208,0,308,100" href="/162/" alt="Angular Momentum" /> <area shape="rect" coords="312,0,412,100" href="/688/" alt="Self-Description" /> <area shape="rect" coords="416,0,520,100" href="/556/" alt="Alternative Energy Revolution" /> </map><br/><br /> Search comic titles and transcripts:<br /> <script type="text/javascript" src="//www.google.com/jsapi"></script> <script type="text/javascript"> google.load("search", "1"); google.search.CustomSearchControl.attachAutoCompletion( "012652707207066138651:zudjtuwe28q", document.getElementById("q"), "cse-search-box"); }); </script> <form action="//www.google.com/cse" id="cse-search-box"> <div> <input type="hidden" name="cx" value="012652707207066138651:zudjtuwe28q" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" name="q" id="q" autocomplete="off" size="31" /> <input type="submit" name="sa" value="Search" /> </div> </form> <script type="text/javascript" src="//www.google.com/cse/brand?form=cse-search-box&lang=en"></script> <a href="/rss.xml">RSS Feed</a> - <a href="/atom.xml">Atom Feed</a> <br /> <br/> <div id="comicLinks"> Comics I enjoy:<br/> <a href="http://threewordphrase.com/">Three Word Phrase</a>, <a href="http://oglaf.com/">Oglaf</a> (nsfw), <a href="http://www.smbc-comics.com/">SMBC</a>, <a href="http://www.qwantz.com">Dinosaur Comics</a>, <a href="http://www.asofterworld.com">A Softer World</a>, <a href="http://buttersafe.com/">Buttersafe</a>, <a href="http://pbfcomics.com/">Perry Bible Fellowship</a>, <a href="http://questionablecontent.net/">Questionable Content</a>, <a href="http://www.buttercupfestival.com/buttercupfestival.htm">Buttercup Festival</a> </div> <br/> Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).<br/> <br/> <h4>We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves. <br />The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.<br />This is not the algorithm. This is close.</h4><br/> <div class="line"></div> <br/> <div id="licenseText"> <!-- <a rel="license" href="http://creativecommons.org/licenses/by-nc/2.5/"><img alt="Creative Commons License" style="border:none" src="http://imgs.xkcd.com/static/somerights20.png" /></a><br/> --> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/2.5/">Creative Commons Attribution-NonCommercial 2.5 License</a>. <!-- <rdf:RDF xmlns="http://web.resource.org/cc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns "><Work rdf:about=""><dc:creator>Randall Munroe</dc:creator><dcterms:rightsHolder>Randall Munroe</dcterms:rightsHolder><dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" /><dc:source rdf:resource="http://www.xkcd.com/"/><license rdf:resource="http://creativecommons.org/licenses/by-nc/2.5/" /></Work><License rdf:about="http://creativecommons.org/licenses/by-nc/2.5/"><permits rdf:resource="http://web.resource.org/cc/Reproduction" /><permits rdf:resource="http://web.resource.org/cc/Distribution" /><requires rdf:resource="http://web.resource.org/cc/Notice" /><requires rdf:resource="http://web.resource.org/cc/Attribution" /><prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /><permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" /></License></rdf:RDF> --> <br/> This means you"re free to copy and share these comics (but not to sell them). <a href="/license.html">More details</a>.<br/> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> </div> </body> </html> """
    except:
        return ""
    return ""
def union(a,b):#The union function merges the second list into first, with out duplicating an element of a, if it's already in a. Similar to set union operator. This function does not change b. If a=[1,2,3] b=[2,3,4]. After union(a,b) makes a=[1,2,3,4] and b=[2,3,4]
 for e in b:
  if e not in a:
   a.append(e)

def get_next_url(page):
 start_link=page.find("a href")
 if(start_link==-1):
  return None,0
 start_quote=page.find('"',start_link)
 end_quote=page.find('"',start_quote+1)
 url=page[start_quote+1:end_quote]
 return url,end_quote
def get_all_links(page):
 links=[]
 while(True):
  url,n=get_next_url(page)
  page=page[n:]
  if url:
   links.append(url)
  else:
   break
 return links


def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   union(tocrawl,get_all_links(get_page(p)))
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled #Returns the list of links
for x in Crawl_web('http://xkcd.com/353'):#printing all the links
 print x

The output, when the above code is run as shown below. Note that in the real world wide web, we can get so many links, that our space may get exhausted before we have crawled the whole web. So it's better to stop at certain depth or certain maximum links are reached. We can modify the Crawl_web function a little to achieve the result, but I will leave that up to you, since it's very simple and intuitive.

http://xkcd.com/353
/license.html
http://www.buttercupfestival.com/buttercupfestival.htm
http://thisisindexed.com/
http://wondermark.com/
http://achewood.com/
http://questionablecontent.net/
http://www.boltcity.com/copper/
http://pbfcomics.com/
http://www.asofterworld.com
http://www.qwantz.com
/atom.xml
/rss.xml
/354/
http://dynamic.xkcd.com/random/comic/
/352/
/1/
/
http://forums.xkcd.com/
/about/
http://store.xkcd.com/
http://blag.xkcd.com/
http://xkcd.com/554
http://buttersafe.com/
http://www.smbc-comics.com/
http://oglaf.com/
http://threewordphrase.com/
/555/
/553/
http://blog.xkcd.com/2012/02/27/geohashing-2/
/archive/

Sunday, March 4, 2012

Web Crawler (Part-2)

Today, we will extract all the links from a given webpage. The webpage is taken as one continuous string. For simplicity, I am taking the source of a smaller webpage. The below is the python code for it

#!/usr/bin/python

def get_next_url(page):
 start_link=page.find("a href")#To find the position of link - by finding position of ahref tag
 if(start_link==-1):#To check if links are absent, since find returns -1 if it couldn't find the search string - in this case "ahref"
  return None,0
 start_quote=page.find('"',start_link)
 end_quote=page.find('"',start_quote+1)
 url=page[start_quote+1:end_quote]
 return url,end_quote
#Returning the end quote so that the next input contains the rest of the page, from the end quote position

page='<html><head><title>My great website</title></head><body><p>Where the hell is every body. Why didn they <i>fall</i> on my site</p><a href="http://dileepwrites.wordpress.com">Dilstories</a><a href="http://opencvuser.blospot.com">Learn openCV</a><a href="http://systemspro.blogspot.com">Learn Systems Programming</a></body></html>'


while(True):
       url,n=get_next_url(page)
       page=page[n:]
       if url:#If it's none, the loop breaks
          print url
       else:
          break

First we have defined a procedure using the def. The input to that function is the rest of the webpage, that we need to find the links from. If we could not find any more links, the procedure returns None and 0. It can be used to stop exploring by using a break in the while loop of main function. Finally the output of the above programme is

http://dileepwrites.wordpress.com
http://opencvuser.blospot.com
http://systemspro.blogspot.com

Saturday, February 25, 2012

Web Crawler (Part-1)

We will be using Python to build the search engine.

If you look at a web page, it contains a lot of information. There will be pictures, videos, buttons and what not. If we try to crawl them directly, it's a huge tedious task. What we will be crawling is the source of the web document. The easiest way to view this is Rightclick-> ViewSource or View Page Source on the webpage you are viewing. Actually, this is the only information, that a Web server sends when a browser requests the page. What we actually see is it's rendering by our Web Browser.

Now, our aim is to find all the links in a given page source of a website or more specifically from a single webpage. How the page source is "Given" will be a future topic and will be covered in latter posts. Also, let us first write, how to extract the first link that appears in that source page, and then in the next part of this topic, we will extract all the links. The Python program that does this is as follows

#!/usr/bin/python

a='<html><head></head><body> <a href="http://opencvuser.blogspot.com">Learn Open CV</a><a href="http://dileepwrites.blogspot.com">Dileep Writes</a></body></html>'
#Storing the source in 'a'

start=a.find('a href="')
#The start of ahref tag

#start+8 Considering no extra white spaces, will be the start of the URL

End=a.find('"',start+8)
#End of URL - where the quotes end

print a[start+8:End]
#Printing the URL without quotes

In the variable a, we are storing a sample html source with two links. Links may be in many ways, but we are assuming that they will start in the most standard manner with <a href></a> tags.

Inorder to extract the first link from a , we are exploring the a using the string find method of Python. This method gives the first occurrence of the substring we are searching. The location is the offset of the substring from the beginning of the string. It's first argument is the string to be searched for and the second optional one, from which location to start searching. So, in start , we are storing the location where the a of a href occurs. After that, we need to find closing quotes of the URL to know where it ends. The starting quotes of the URL occur at location start+7 , so from there on, we can search for the End of the URL where the quotes end. Then we are printing the URL. The below one is the output

http://opencvuser.blogspot.com

Thus we have extracted the first link. In the next post I will tell you how to search and extract the all the URL's. It's just a little bit extension from this. Also, if any python gurus are out there, I would love to see this code much more simplified.

Pages