Today, we will extract all the links from a given webpage. The webpage is taken as one continuous string. For simplicity, I am taking the source of a smaller webpage. The below is the python code for it
First we have defined a procedure using the def. The input to that function is the rest of the webpage, that we need to find the links from. If we could not find any more links, the procedure returns None and 0. It can be used to stop exploring by using a break in the while loop of main function. Finally the output of the above programme is
#!/usr/bin/python def get_next_url(page): start_link=page.find("a href")#To find the position of link - by finding position of ahref tag if(start_link==-1):#To check if links are absent, since find returns -1 if it couldn't find the search string - in this case "ahref" return None,0 start_quote=page.find('"',start_link) end_quote=page.find('"',start_quote+1) url=page[start_quote+1:end_quote] return url,end_quote #Returning the end quote so that the next input contains the rest of the page, from the end quote position page='<html><head><title>My great website</title></head><body><p>Where the hell is every body. Why didn they <i>fall</i> on my site</p><a href="http://dileepwrites.wordpress.com">Dilstories</a><a href="http://opencvuser.blospot.com">Learn openCV</a><a href="http://systemspro.blogspot.com">Learn Systems Programming</a></body></html>' while(True): url,n=get_next_url(page) page=page[n:] if url:#If it's none, the loop breaks print url else: break
http://dileepwrites.wordpress.com http://opencvuser.blospot.com http://systemspro.blogspot.com
You said we have to merged this with previous part to make it work.
ReplyDeleteDo you mean we must write 📝 the both 2 in one python file and save or we can write in different file and if that how can we merged the 2. THANKS