We will be using Python to build the search engine.
If you look at a web page, it contains a lot of information. There will be pictures, videos, buttons and what not. If we try to crawl them directly, it's a huge tedious task. What we will be crawling is the source of the web document. The easiest way to view this is Rightclick-> ViewSource or View Page Source on the webpage you are viewing. Actually, this is the only information, that a Web server sends when a browser requests the page. What we actually see is it's rendering by our Web Browser.
Now, our aim is to find all the links in a given page source of a website or more specifically from a single webpage. How the page source is "Given" will be a future topic and will be covered in latter posts. Also, let us first write, how to extract the first link that appears in that source page, and then in the next part of this topic, we will extract all the links. The Python program that does this is as follows
In the variable a, we are storing a sample html source with two links. Links may be in many ways, but we are assuming that they will start in the most standard manner with <a href></a> tags.
Inorder to extract the first link from a , we are exploring the a using the string find method of Python. This method gives the first occurrence of the substring we are searching. The location is the offset of the substring from the beginning of the string. It's first argument is the string to be searched for and the second optional one, from which location to start searching. So, in start , we are storing the location where the a of a href occurs. After that, we need to find closing quotes of the URL to know where it ends. The starting quotes of the URL occur at location start+7 , so from there on, we can search for the End of the URL where the quotes end. Then we are printing the URL. The below one is the output
Thus we have extracted the first link. In the next post I will tell you how to search and extract the all the URL's. It's just a little bit extension from this. Also, if any python gurus are out there, I would love to see this code much more simplified.
If you look at a web page, it contains a lot of information. There will be pictures, videos, buttons and what not. If we try to crawl them directly, it's a huge tedious task. What we will be crawling is the source of the web document. The easiest way to view this is Rightclick-> ViewSource or View Page Source on the webpage you are viewing. Actually, this is the only information, that a Web server sends when a browser requests the page. What we actually see is it's rendering by our Web Browser.
Now, our aim is to find all the links in a given page source of a website or more specifically from a single webpage. How the page source is "Given" will be a future topic and will be covered in latter posts. Also, let us first write, how to extract the first link that appears in that source page, and then in the next part of this topic, we will extract all the links. The Python program that does this is as follows
#!/usr/bin/python a='<html><head></head><body> <a href="http://opencvuser.blogspot.com">Learn Open CV</a><a href="http://dileepwrites.blogspot.com">Dileep Writes</a></body></html>' #Storing the source in 'a' start=a.find('a href="') #The start of ahref tag #start+8 Considering no extra white spaces, will be the start of the URL End=a.find('"',start+8) #End of URL - where the quotes end print a[start+8:End] #Printing the URL without quotes
In the variable a, we are storing a sample html source with two links. Links may be in many ways, but we are assuming that they will start in the most standard manner with <a href></a> tags.
Inorder to extract the first link from a , we are exploring the a using the string find method of Python. This method gives the first occurrence of the substring we are searching. The location is the offset of the substring from the beginning of the string. It's first argument is the string to be searched for and the second optional one, from which location to start searching. So, in start , we are storing the location where the a of a href occurs. After that, we need to find closing quotes of the URL to know where it ends. The starting quotes of the URL occur at location start+7 , so from there on, we can search for the End of the URL where the quotes end. Then we are printing the URL. The below one is the output
http://opencvuser.blogspot.com
Thus we have extracted the first link. In the next post I will tell you how to search and extract the all the URL's. It's just a little bit extension from this. Also, if any python gurus are out there, I would love to see this code much more simplified.