The search engine is complete now. Although I have to improve it, so that the search responds to multi-word queries. For now, it responds to only single word queries. I made the code available in https://github.com/dileep98490/A-simple-Search-Engine-in-Python along with a README file. Till the previous post, I have been using a custom get_page(url) module, which returns the source of specific webpages. I changed the get_page(url), so that it returns the source of any url you pass into it as input. Below is the modified get_page(url)
import urllib def get_page(url):#This function is just to return the webpage contents; the source of the webpage when a url is given. try: f = urllib.urlopen(url) page = f.read() f.close() #print page return page except: return "" return ""
Also, I have created a input taking mechanism such that any page can be given as seed page. Also, the query is given as input, along with the maximum links to be checked as depth. The output is as shown below (the full source code is available on git-hub repository I mentioned above).
Enter the seed page http://opencvuser.blogspot.com Enter What you want to search is Enter the depth you wanna go 5 Started crawling, presently at depth.. 4 3 2 1 0 Printing the results as is with page rank http://opencvuser.blogspot.com --> 0.05 https://skydrive.live.com/redir.aspx?cid=124ec5b5bc117437&resid=124EC5B5BC11 7437!161&parid=124EC5B5BC117437!103&authkey=!AMnmS6xJcSrXSyg --> 0.05142 85714286 After Sorting the results by page rank 1. https://skydrive.live.com/redir.aspx?cid=124ec5b5bc117437&resid=124E C5B5BC117437!161&parid=124EC5B5BC117437!103&authkey=!AMnmS6xJcSrXSyg 2. http://opencvuser.blogspot.comThe two results were sorted later as per their page rank. I have taken the depth as 5, so the program crawled 5 links and out of them 2 contain our keyword "is"