A Simple Python Web Crawler That's Easy to Build on | Esux Python Ap...
1 of 5
http://www.esux.net/content/simple-python-web-crawler-thats-easy-build
7/5/2010 1:40 PM
A Simple Python Web Crawler That's Easy to Build on | Esux Python Ap...
2 of 5
Web Cra w ler
S hell S can
LINKS Ads by Google Crawler Tractor Meta Tags Crawler Dozer Crawler Excavator Domain URL ReDirect
User login Username: *
DogPile SearchS py
YOU ARE HERE
Create new account Request new password
Typo Generat or
A Simple Python Web Crawler That's Easy to Build on
A Simple Python Web Crawler That's Easy to Build on A common project for a learning programmer is a web crawler. They can be very powerful tools, just ask google. With the idea of a web crawler is fairly simple in thought, implementing one is another story. Over the past few months I've had the urge to make my own mini google crawler so I figured I'd give it a go. After throwing together two very sloppy versions of a crawler I finally felt that it was time to make one that would fit my needs in the future if the need presented itself. So I set out to code a web crawler one last time with the future in mind.
Password: *
Log in
http://www.esux.net/content/simple-python-web-crawler-thats-easy-build
Web Data Extraction
Web Spider Software
Fast, powerful, reliable and easy web data extraction. Free Trial! www.AutomationAnywhere.com/extract
Extract web content and metadata from websites into your database www.newprosoft.com
In knew that I wanted to use the BeautifulSoup module from www.crummy.com/software /BeautifulSoup/ because of how easy it made parsing and how forgiving it was to malformed html. Creating a parser class with BeautifulSoup couldn't have been any easier. The parser is capable of parsing several different things from html including: meta keywords, meta descriptions, internal links, external links, and emails (I must admit the email parsing function needs work). All of the previously mentioned are parsed upon creating a new parser object. Then can then be access by simply accessing the class level variables. ex: parser.keywords would return a list of meta keywords from the html given to the parser. The parser class also takes advantage of the urljoin function from the urlparse module in python. This module turned out to be VERY helpful when writing code to parse and create full links from relative ones. Using string literals alone to try and create a complete url from a relative one on a web page can turn into a real mess very quickly. The bulk of the code for this project was in the parser. Once that was out of the way the rest of the code pretty much fell into place. The Crawler class has two methods. One method opens a web and uses the parser class to gather the needed information and the other method is meant to be over ridden if the class is expanded upon. crawl_next() simply crawls the next web page in the toCrawl list and returns True if there are more pages to be crawled. page_complete() is called upon the completion of the crawl_next() method and has the parser passed to it. Therefore, any actions that you may want to take on the data in the parser can be done easily. This may sound confusing now, but the code is commented well and pretty much explains itself. Using the crawler is simple. The code might look something like this if all you intend on doing is crawling pages and discarding the data afterwards:
if __name__ == '__main__': c = Crawler('http://www.esux.net') #Make a crawler object while c.crawl_next() == True: #While there is more to be crawled, c.crawl_next() #keep cralwing.
That's about it. Download links are provided below as well as the text in an html format. I look forward to hearing your comments! Crawler.py http://www.esux.net/downloads/Crawler.txt Parser.py http://www.esux.net/downloads/Parser.txt BeautifulSoup.py http://www.esux.net/downloads/BeautifulSoup.txt
# # # # # #
Name: Version: Author: Contact: Last Updated:
Parser.py 1.0 Matthew Zizzi mhzizzi AT gmail.com Dec 2009
7/5/2010 1:40 PM
A Simple Python Web Crawler That's Easy to Build on | Esux Python Ap...
3 of 5
# # # # # # # # # # # # #
http://www.esux.net/content/simple-python-web-crawler-thats-easy-build
A class for parsing a web pageas needed by a web crawler Copyright (c) 2009, Matthew Zizzi THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. THIS CODE IS FREE TO USE. HOWEVER, IF IT IS TO BE USED COMMERCIALLY YOU MUST GET PERMISSION FROM THE AUTHOR. WHEN REDISTRIBUTING OR USING THIS CODE IN YOUR OWN APPS YOU MUST GIVE CREDIT TO THE AUTHOR. USERS ARE FREE TO MODIFY AND REDISTRIBUTE THIS CODE SO LONG AS CREDIT IS GIVEN TO THE ORIGINAL AUTHOR.
from BeautifulSoup import BeautifulSoup from urlparse import urljoin import re class Parser(): def __init__(self, _base_url, _html): self.base_url = _base_url self.html = _html self.description = '' self.keywords = [] self.i_links = [] self.e_links = [] self.emails = [] self.parse_links() self.parse_meta() self.parse_emails() def parse_links(self): """Pase all links and add them to the either self.e_links or self.i_links depending on their host""" #Find all a tags soup = BeautifulSoup(self.html.lower()) aTags = soup.findAll('a') for tag in aTags: #Only look at a tags with the href attribute if tag.has_key('href'): #Make a complete link link = urljoin(self.base_url, tag['href']) #Add link to appropriate list if self.isInternalLink(self.base_url, link) and not link in self.i_links: self.i_links.append(link) elif link not in self.e_links: self.e_links.append(link) def parse_meta(self): """Parse page description and keywords from meta tags if it is present""" soup = BeautifulSoup(self.html.lower()) #Parse meta descriptions metaDescriptions = soup.findAll('meta', {'name':'description'}) description = '' for tag in metaDescriptions: if tag.has_key('content'): description += tag['content'] self.description = description #Parse meta keywords metaKeywords = soup.findAll('meta', {'name':'keywords'}) for tag in metaKeywords: if tag.has_key('content'): keywords = tag['content'].split(',') for keyword in keywords: self.keywords.append(keyword.strip()) def parse_emails(self): """By no means a perfect regex.... Oh well.""" r = re.compile('[a-zA-Z0-9_\-\.]+@[0-9a-zA-Z]+\.[a-zA-Z]{1,4}') results = r.findall(self.html) self.emails += results def stripHttp(self, url): """Gets rid of http:// and www. at the start of a string if it is present""" if url.startswith('http://'): url = url[7:] elif url.startswith('https://'): url = url[8:] if url.startswith('www.'): url = url[4:] elif url.startswith('www2.'):
7/5/2010 1:40 PM
A Simple Python Web Crawler That's Easy to Build on | Esux Python Ap...
4 of 5
http://www.esux.net/content/simple-python-web-crawler-thats-easy-build
url = url[5:] return url def getDomain(self, url): """Returns the domain from a given url""" url = self.stripHttp(url) if url.find('/') > 0: url = url[:url.find('/')] return url def isInternalLink(self, base_url, url2): """Grabs the domain from two urls and returns true if both urls have the same domain""" domain1 = self.getDomain(base_url) domain2 = self.getDomain(url2) return domain1 == domain2
# # # # # # # # # # # # # # # # # # #
Name: Version: Author: Contact: Last Updated:
Crawler.py 1.0 Matthew Zizzi mhzizzi AT gmail.com Dec 2009
An expandable class for crawling web sites. Copyright (c) 2009, Matthew Zizzi THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. THIS CODE IS FREE TO USE. HOWEVER, IF IT IS TO BE USED COMMERCIALLY YOU MUST GET PERMISSION FROM THE AUTHOR. WHEN REDISTRIBUTING OR USING THIS CODE IN YOUR OWN APPS YOU MUST GIVE CREDIT TO THE AUTHOR. USERS ARE FREE TO MODIFY AND REDISTRIBUTE THIS CODE SO LONG AS CREDIT IS GIVEN TO THE ORIGINAL AUTHOR.
from Parser import Parser import urllib2 class Crawler(): def __init__(self, _start_url): self.toCrawl = [] self.toCrawl.append(_start_url) self.completed = [] self.errors = [] self.craw_external_links = True
#urls that will be followed #add the starting url to the list of urls that need to be crawled #urls the bot crawler has already crawled #any urls that cause an error during parsing / crawling #determines if the crawler should crawl outbound links
def crawl_next(self): """This is the heart of the crawler, it crawls and adds new links to the queue @return - True if there are more links to be crawled, false otherwise""" if len(self.toCrawl) > 0: current_url = self.toCrawl[0] #variable to store the current url being crawled self.toCrawl.pop(0) #delete the current url from the links that need to be crawled #Attempt to crawl the url. If there is an exception, add the url to self.error #otherwise add the link to self.completed. Then call the self.page_complete method try: html = urllib2.urlopen(current_url).read() #get html from url parser = Parser(current_url, html) #create a new parser object self.completed.append(current_url) #add current_url to self.completed for link in parser.i_links: #add internal links that have not # been crawled to self.toCrawl if not link in self.completed \ and not link in self.toCrawl \ and not link in self.errors: self.toCrawl.append(link) if self.craw_external_links == True: #if craw_external_links == True add all external for link in parser.e_links: # links that have not been crawled to self.toCrawl if not link in self.completed \ and not link in self.toCrawl \ and not link in self.errors: self.toCrawl.append(link) self.page_complete(parser) #call page_complete() except: #error somewhere, add current_url to self.errors.append(current_url) # self.errors print 'Error loading: ' + current_url self.page_complete(Parser('',''), error=True) #call page_complete() with error flag""" return True else:
7/5/2010 1:40 PM
A Simple Python Web Crawler That's Easy to Build on | Esux Python Ap...
5 of 5
http://www.esux.net/content/simple-python-web-crawler-thats-easy-build
return False def page_complete(self, parser, error=False): """Meant to be overridden by extending this class. user any of the data gathered by the crawler.""" if len(self.toCrawl) > 0: print 'Next URL: ' + self.toCrawl[0]
This is where you would
if __name__ == '__main__': c = Crawler('http://www.esux.net') while c.crawl_next() == True: c.crawl_next()
Add new comment
7/5/2010 1:40 PM