Log in to like this post! Using BeautifulSoup to Scrape Websites Torrey Betts / Wednesday, March 30, 2016 Introduction Beautiful Soup is a powerful Python library for extracting data from XML and HTML files. It helps format & organize the confusing XML/HTML structure to present it with an easily traversed Python object. With only a few lines of code you can easily extract information from most websites or files. This blog post will barely scratch the surface of what's possible with BeautifulSoup, be sure to visit the reference links at the bottom of this post to learn more. Installing BeautifulSoup If you're using a Debian based distribution of Linux, BeautifulSoup can be installed by executing the following command. $ apt-get install python-bs4 If you're unable to use the Debian system package manager, you can install BeautifulSoup using easy_install or pip. $ easy_install beautifulsoup4 $ pip install beautifulsoup4 If you can't install using any of the following methods it's possible to use the source tarball and install with setup.py. $ python setup.py install To learn more about installing or any possible errors that could occur, visit the BeautifulSoup site. Your First Soup Object The soup object is the most used object in the BeautifulSoup library as it will house the entire HTML/XML structure that you'll query information from. Creating this object requires 2 lines of code. html = urlopen("http://ko.infragistics.com")soup = BeautifulSoup(html.read(), 'html.parser') Taking this one step further, we'll use the soup object to print out the pages H1 tag. from urllib import url openfrom bs4 import BeautifulSoup html = urlopen("http://ko.infragistics.com")soup = BeautifulSoup(html.read(), 'html.parser')print soup.h1.get_text() Outputs: Experience Matters Querying the Soup Object BeautifulSoup has multiple ways to navigate or query the document structure. find(tag, attributes, recursive, text, keywords) findAll(tag, attributes, recursive, text, limit, keywords) navigation using tags find Method This method looks through the document and retrieves the first single item that matches the provided filters. If the method can't find what you've search, None is returned. One example would be you want to search for the title of the page. page_title = soup.find("title") The page_title variable now contains the page title wrapped in it's title tag. Another example would be if you wanted to search the page for a specific tag id. element_result = soup.find(id="theid") The element_result variable now contains the HTML element that matched the query for id, "theid". findAll Method This method looks through the tag's descendants and retrieves all descendants that match the provided filters. If method can't find what you've searched for an empty list is returned. One example and simplest usage would be that you want to search for all hyperlinks on a page. results = soup.findAll("a") The variable results now contains a list of all hyperlinks found on the page. Another example might be you want to find all hyperlinks on a page, but they are using a specific class name. results = soup.findAll("a", "highlighted") The variable results now contains a list of all hyperlinks found on the page that reference the class name "highlighted". Searching for tags along with their id is very simliar and could be done in multiple ways, below I'll demonstrate 2 different ways. results = soup.findAll("a", id="their")results = soup.findAll(id="theid") Navigation using Tags To understand how navigation using tags would work, imagine that the HTML structure is mapped like a tree. html -> head -> title -> meta -> link -> script body -> h1 -> div.content and so on... Using this reference along with a page's source if we wanted to print the page title, the code would look like this. print soup.head.title Outputs: <title>Developer Controls and Design Tools - .Net Components & Controls</title> Scraping a Website Using what was learned in previous section we're now going to apply that knowledge to scraping the definition from an Urban Dictionary page. The Python script looks for command line arguments that are comma separated to define. When scraping the definition from the page we use BeautifulSoup to search the page for a div tag that has the class name "meaning". import sys, getoptfrom urllib import url openfrom bs4 import BeautifulSoup def main(argv): words = [] rootUrl = 'http://www.urbandictionary.com/define.php?term=' usageText = sys.argv[0] + ' -w <word1>,<word2>,<word3>.....' try: if (len(argv) == 0): print usageText sys.exit(2) opts, args = getopt.getopt(argv, "w:v") except getopt.GetoptError: print usageText sys.exit(2) for opt, arg in opts: if opt == "-w": words = set(arg.split(",")) for word in words: wordUrl = rootUrl + word html = urlopen(wordUrl) soup = BeautifulSoup(html.read(), 'html.parser') meaning = soup.findAll("div", "meaning") print word + " -- " + meaning[0].get_text().replace("\n", "") if __name__ == "__main__": main(sys.argv[1:]) Outputs: python urbandict.py -w programmingprogramming -- The art of turning caffeine into Error Messages. References The reference links below are related to this blog post. If you're interested in more information about using BeautifulSoup a great resource is the Web Scraping with Python book. BeautifulSoup: Installing BeautifulSoup, Kinds of Objects, find, findAll easy_install: Installing easy_install pip: Installing pip By Torrey Betts