Friday, March 16, 2012

A Simple Search Engine

I have built a simple search engine as a follow up to my previous posts. This search engine, returns the list of url's that contain a specific keyword in the pages the engine had crawled till then. The functions of the last post were reused except that the Crawl_web function is modified. Along with it, three more new functions are added. The four of them are as shown below. So, in the main function, we first let the function crawl, giving a specific url as seed, forming an index. Then we go for looking the word "how" in the index (We can search for any keyword, just for demonstration I am using how). The Look_up function helps us in our later task returning a list of urls that contain our keyword.

def Look_up(index,keyword):#This function is for given an index, it finds the keyword in the index and returns the list of links
 for i in index:
  if i[0]==keyword:
   for j in i[1]:
 return f
#The format of element in the index is <keyword>,[<List of urls that contain the keyword>]
def add_to_index(index,url,keyword):
 for i in index:
  if keyword==i[0]:
def add_page_to_index(index,url,content):#Adding the content of the webpage to the index
 for i in content.split():

def Crawl_web(seed):#The website to act as seed page is given as input
 while tocrawl:
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index #Returns the list of links
crawled,index=Crawl_web('')#printing all the links
#print index 
print Look_up(index,"how")#We are looking for the keyword "how" in the index

The output of the above code is as shown below. Please note that the above code is not complete, but had to be merged with the code previous post to be a complete one.


Monday, March 12, 2012

Web Crawler (Part-3)

This post give you a way to extract all the links. What this program does, it will be better to explain in a step by step manner.

1. A url of the seed page is taken and passed to the Crawl_web function
2. The Crawl_web function takes the passed url as input string
2. It crawl the entire webpage with the given url and finds all the url's with <a href> tag in the page
3. Then it takes one of those url's, sees that it was not already crawled and goes to step-1. If all the url's are over, then it goes to the  next step
4. The Crawl_web function returns the list of all the url's obtained by crawling starting from the given seed page
5. We print the url's

The program in python is as follows

#Author :

def get_page(url):#This function is just to return the webpage contents; the source of the webpage when a url is given. Note that this is not the original function and it is used to just serve the purpose here. To return the source of certain url's
        if url == "":
            return  """<?xml version="1.0" encoding="utf-8" ?><?xml-stylesheet href="" type="text/css" media="screen" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" ""><html xmlns=""> <head> <title>xkcd: Python</title> <link rel="stylesheet" type="text/css" href="" media="screen" title="Default" /> <!--[if IE]><link rel="stylesheet" type="text/css" href="" media="screen" title="Default" /><![endif]--> <link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml" /> <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml" /> <link rel="icon" href="" type="image/x-icon" /> <link rel="shortcut icon" href="" type="image/x-icon" /> </head> <body> <div id="container"> <div id="topContainer"> <div id="topLeft" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s">\t<ul> <li><a href=""">Archive</a><br /></li>\t <li><a href="">News/Blag</a><br /></li> <li><a href="">Store</a><br /></li> <li><a href="/about/">About</a><br /></li> <li><a href="">Forums</a><br /></li> </ul> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="topRight" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <div id="topRightContainer"> <div id="logo"> <a href="/"><img src="" alt=" logo" height="83" width="185"/></a> <h2><br />A webcomic of romance,<br/> sarcasm, math, and language.</h2> <div class="clearleft"></div> <br />XKCD updates every Monday, Wednesday, and Friday. </div> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> <div id="contentContainer"> <div id="middleContent" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"><h1>Python</h1><br/><br /><div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/352/" accesskey="p">&lt; Prev</a></li> <li><a href="" id="rnd_btn_t">Random</a></li> <li><a href="/354/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul></div><br/><br/><img src="" title="I wrote 20 short programs in Python yesterday. It was wonderful. Perl, Im leaving you." alt="Python" /><br/><br/><div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/352/" accesskey="p">&lt; Prev</a></li> <li><a href="" id="rnd_btn_b">Random</a></li> <li><a href="/354/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul></div><h3>Permanent link to this comic:</h3><h3>Image URL (for hotlinking/embedding):</h3><div id="transcript" style="display: none">[[ Guy 1 is talking to Guy 2, who is floating in the sky ]]Guy 1: You39;re flying! How?Guy 2: Python!Guy 2: I learned it last night! Everything is so simple!Guy 2: Hello world is just 39;print &quot;Hello, World!&quot; 39;Guy 1: I dunno... Dynamic typing? Whitespace?Guy 2: Come join us! Programming is fun again! It39;s a whole new world up here!Guy 1: But how are you flying?Guy 2: I just typed 39;import antigravity39;Guy 1: That39;s it?Guy 2: ...I also sampled everything in the medicine cabinet for comparison.Guy 2: But i think this is the python.{{ I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I39;m leaving you. }}</div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="middleFooter" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <img src="" width="520" height="100" alt="Selected Comics" usemap=" comicmap" /> <map name="comicmap"> <area shape="rect" coords="0,0,100,100" href="/150/" alt="Grownups" /> <area shape="rect" coords="104,0,204,100" href="/730/" alt="Circuit Diagram" /> <area shape="rect" coords="208,0,308,100" href="/162/" alt="Angular Momentum" /> <area shape="rect" coords="312,0,412,100" href="/688/" alt="Self-Description" /> <area shape="rect" coords="416,0,520,100" href="/556/" alt="Alternative Energy Revolution" /> </map><br/><br />Search comic titles and transcripts:<br /><script type="text/javascript" src="//"></script><script type="text/javascript"> google.load(\"search\", \"1\"); google.setOnLoadCallback(function() { \"012652707207066138651:zudjtuwe28q\", document.getElementById(\"q\"), \"cse-search-box\"); });</script><form action="//" id="cse-search-box"> <div> <input type="hidden" name="cx" value="012652707207066138651:zudjtuwe28q" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" name="q" id="q" autocomplete="off" size="31" /> <input type="submit" name="sa" value="Search" /> </div></form><script type="text/javascript" src="//"></script><a href="/rss.xml">RSS Feed</a> - <a href="/atom.xml">Atom Feed</a><br /> <br/> <div id="comicLinks"> Comics I enjoy:<br/> <a href="">Dinosaur Comics</a>, <a href="">A Softer World</a>, <a href="">Perry Bible Fellowship</a>, <a href="">Copper</a>, <a href="">Questionable Content</a>, <a href="">Achewood</a>, <a href="">Wondermark</a>, <a href="">Indexed</a>, <a href="">Buttercup Festival</a> </div> <br/> Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).<br/> <br/> <h4>We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves. <br />The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.<br />This is not the algorithm. This is close.</h4><br/> <div class="line"></div> <br/> <div id="licenseText"> <!-- <a rel="license" href=""><img alt="Creative Commons License" style="border:none" src="" /></a><br/> --> This work is licensed under a <a rel="license" href="">Creative Commons Attribution-NonCommercial 2.5 License</a>.<!-- <rdf:RDF xmlns="" xmlns:dc="" xmlns:dcterms="" xmlns:rdf=" "><Work rdf:about=""><dc:creator>Randall Munroe</dc:creator><dcterms:rightsHolder>Randall Munroe</dcterms:rightsHolder><dc:type rdf:resource="" /><dc:source rdf:resource=""/><license rdf:resource="" /></Work><License rdf:about=""><permits rdf:resource="" /><permits rdf:resource="" /><requires rdf:resource="" /><requires rdf:resource="" /><prohibits rdf:resource="" /><permits rdf:resource="" /></License></rdf:RDF> --> <br/> This means you\"re free to copy and share these comics (but not to sell them). <a href="/license.html">More details</a>.<br/> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> </div> </body></html> """
        elif url == "":
            return  """<?xml version="1.0" encoding="utf-8" ?> <?xml-stylesheet href="" type="text/css" media="screen" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" ""> <html xmlns=""> <head> <title>xkcd: Not Enough Work</title> <link rel="stylesheet" type="text/css" href="" media="screen" title="Default" /> <!--[if IE]><link rel="stylesheet" type="text/css" href="" media="screen" title="Default" /><![endif]--> <link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml" /> <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml" /> <link rel="icon" href="" type="image/x-icon" /> <link rel="shortcut icon" href="" type="image/x-icon" /> </head> <body> <div id="container"> <div id="topContainer"> <div id="topLeft" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <ul> <li><a href="/archive/">Archive</a><br /></li> <li><a href="">News/Blag</a><br /></li> <li><a href="">Store</a><br /></li> <li><a href="/about/">About</a><br /></li> <li><a href="">Forums</a><br /></li> </ul> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="topRight" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <div id="topRightContainer"> <div id="logo"> <a href="/"><img src="" alt=" logo" height="83" width="185"/></a> <h2><br />A webcomic of romance,<br/> sarcasm, math, and language.</h2> <div class="clearleft"></div> XKCD updates every Monday, Wednesday, and Friday. <br /> Blag: Remember geohashing? <a href="">Something pretty cool</a> happened Sunday. </div> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> <div id="contentContainer"> <div id="middleContent" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <h1>Not Enough Work</h1><br/> <br /> <div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/553/" accesskey="p">&lt; Prev</a></li> <li><a href="" id="rnd_btn_t">Random</a></li> <li><a href="/555/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul> </div> <br/> <br/> <img src="" title="It39;s even harder if you39;re an asshole who pronounces &lt;&gt; brackets." alt="Not Enough Work" /><br/> <br/> <div class="menuCont"> <ul> <li><a href="/1/">|&lt;</a></li> <li><a href="/553/" accesskey="p">&lt; Prev</a></li> <li><a href="" id="rnd_btn_b">Random</a></li> <li><a href="/555/" accesskey="n">Next &gt;</a></li> <li><a href="/">&gt;|</a></li> </ul> </div> <h3>Permanent link to this comic:</h3> <h3>Image URL (for hotlinking/embedding):</h3> <div id="transcript" style="display: none">Narration: Signs your coders don39;t have enough work to do: [[A man sitting at his workstation; a female co-worker behind him]] Man: I39;m almost up to my old typing speed in dvorak [[Two men standing by a server rack]] Man  1: Our servers now support gopher. Man  1: Just in case. [[A woman standing near her workstation speaking to a male co-worker]] Woman: Our pages are now HTML, XHTML-STRICT, and haiku-compliant Man: Haiku? Woman: &lt;div class=&quot;main&quot;&gt; Woman: &lt;span id=&quot;marquee&quot;&gt; Woman: Blog!&lt; span&gt;&lt; div&gt; [[A woman sitting at her workstation]] Woman: Hey! Have you guys seen this webcomic? {{title text: It39;s even harder if you39;re an asshole who pronounces &lt;&gt; brackets.}}</div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> <div id="middleFooter" class="dialog"> <div class="hd"><div class="c"></div></div> <div class="bd"> <div class="c"> <div class="s"> <img src="" width="520" height="100" alt="Selected Comics" usemap=" comicmap" /> <map name="comicmap"> <area shape="rect" coords="0,0,100,100" href="/150/" alt="Grownups" /> <area shape="rect" coords="104,0,204,100" href="/730/" alt="Circuit Diagram" /> <area shape="rect" coords="208,0,308,100" href="/162/" alt="Angular Momentum" /> <area shape="rect" coords="312,0,412,100" href="/688/" alt="Self-Description" /> <area shape="rect" coords="416,0,520,100" href="/556/" alt="Alternative Energy Revolution" /> </map><br/><br /> Search comic titles and transcripts:<br /> <script type="text/javascript" src="//"></script> <script type="text/javascript"> google.load("search", "1"); "012652707207066138651:zudjtuwe28q", document.getElementById("q"), "cse-search-box"); }); </script> <form action="//" id="cse-search-box"> <div> <input type="hidden" name="cx" value="012652707207066138651:zudjtuwe28q" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" name="q" id="q" autocomplete="off" size="31" /> <input type="submit" name="sa" value="Search" /> </div> </form> <script type="text/javascript" src="//"></script> <a href="/rss.xml">RSS Feed</a> - <a href="/atom.xml">Atom Feed</a> <br /> <br/> <div id="comicLinks"> Comics I enjoy:<br/> <a href="">Three Word Phrase</a>, <a href="">Oglaf</a> (nsfw), <a href="">SMBC</a>, <a href="">Dinosaur Comics</a>, <a href="">A Softer World</a>, <a href="">Buttersafe</a>, <a href="">Perry Bible Fellowship</a>, <a href="">Questionable Content</a>, <a href="">Buttercup Festival</a> </div> <br/> Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).<br/> <br/> <h4>We did not invent the algorithm. The algorithm consistently finds Jesus. The algorithm killed Jeeves. <br />The algorithm is banned in China. The algorithm is from Jersey. The algorithm constantly finds Jesus.<br />This is not the algorithm. This is close.</h4><br/> <div class="line"></div> <br/> <div id="licenseText"> <!-- <a rel="license" href=""><img alt="Creative Commons License" style="border:none" src="" /></a><br/> --> This work is licensed under a <a rel="license" href="">Creative Commons Attribution-NonCommercial 2.5 License</a>. <!-- <rdf:RDF xmlns="" xmlns:dc="" xmlns:dcterms="" xmlns:rdf=" "><Work rdf:about=""><dc:creator>Randall Munroe</dc:creator><dcterms:rightsHolder>Randall Munroe</dcterms:rightsHolder><dc:type rdf:resource="" /><dc:source rdf:resource=""/><license rdf:resource="" /></Work><License rdf:about=""><permits rdf:resource="" /><permits rdf:resource="" /><requires rdf:resource="" /><requires rdf:resource="" /><prohibits rdf:resource="" /><permits rdf:resource="" /></License></rdf:RDF> --> <br/> This means you"re free to copy and share these comics (but not to sell them). <a href="/license.html">More details</a>.<br/> </div> </div> </div> </div> <div class="ft"><div class="c"></div></div> </div> </div> </div> </body> </html> """
        return ""
    return ""
def union(a,b):#The union function merges the second list into first, with out duplicating an element of a, if it's already in a. Similar to set union operator. This function does not change b. If a=[1,2,3] b=[2,3,4]. After union(a,b) makes a=[1,2,3,4] and b=[2,3,4]
 for e in b:
  if e not in a:

def get_next_url(page):
 start_link=page.find("a href")
  return None,0
 return url,end_quote
def get_all_links(page):
  if url:
 return links

def Crawl_web(seed):#The website to act as seed page is given as input
 while tocrawl:
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled #Returns the list of links
for x in Crawl_web(''):#printing all the links
 print x

The output, when the above code is run as shown below. Note that in the real world wide web, we can get so many links, that our space may get exhausted before we have crawled the whole web. So it's better to stop at certain depth or certain maximum links are reached. We can modify the Crawl_web function a little to achieve the result, but I will leave that up to you, since it's very simple and intuitive.

Sunday, March 4, 2012

Web Crawler (Part-2)

Today, we will extract all the links from a given webpage. The webpage is taken as one continuous string. For simplicity, I am taking the source of a smaller webpage. The below is the python code for it


def get_next_url(page):
 start_link=page.find("a href")#To find the position of link - by finding position of ahref tag
 if(start_link==-1):#To check if links are absent, since find returns -1 if it couldn't find the search string - in this case "ahref"
  return None,0
 return url,end_quote
#Returning the end quote so that the next input contains the rest of the page, from the end quote position

page='<html><head><title>My great website</title></head><body><p>Where the hell is every body. Why didn they <i>fall</i> on my site</p><a href="">Dilstories</a><a href="">Learn openCV</a><a href="">Learn Systems Programming</a></body></html>'

       if url:#If it's none, the loop breaks
          print url

First we have defined a procedure using the def. The input to that function is the rest of the webpage, that we need to find the links from. If we could not find any more links, the procedure returns None and 0. It can be used to stop exploring by using a break in the while loop of main function. Finally the output of the above programme is