PyQuery Tutorial: Basic HTML Parsing with PyQuery

As Python is my programming language of choice when it comes to getting things done quickly, I need a dead simple XML parser that gets me the data I want and gets the hell out of the way.

Enter PyQuery

PyQuery, as you may have guessed, is a Python port of the extremely popular jQuery JavaScript library. Anyone even remotely experienced with jQuery know how easy it is to select any element you wish from the DOM. Once you move away from JavaScript, many XML parsers become extremely verbose. PyQuery helps us keeps things simple and extract the data we want without wasting any time.

Using PyQuery for Basic Parsing

PyQuery includes many of the jQuery DOM manipulation methods. For this tutorial, we'll just deal with retrieving data from HTML. Once you can read the HTML to a string via PyQuery, you can instantly apply your knowledge of jQuery and append(), remove(), or whatever you need.

The Setup

This guide uses Python 2.6. If you don't have virtualenv, grab it now. We'll use it in a minute to install the PyQuery package. Now create a project directory for this tutorial. I'll call it pyquery_tutorial
$ mkdir pyquery_tutorial
$ cd pyquery_tutorial
Now create the virtual environment with your Python executable version of choice (I have only tested this for 2.6 and 2.7)
$ virtualenv env --python=python2.6
Running virtualenv with interpreter /usr/bin/python2.6
New python executable in env/bin/python2.6
Also creating executable in env/bin/python
Installing distribute.................................................................................................................................................................................done.
Now activate the virtualenv. (You should see (env) beside your prompt if done correctly)
$ . env/bin/activate
Now we install the PyQuery package.
(env) $ pip install pyquery
Successfully installed lxml pyquery
Cleaning up...
Woohoo, PyQuery is now ready for use!

Using PyQuery

Using PyQuery for parsing will feel extremely similar to using jQuery. One of the only differences is initializing the jQuery object. First, create this html file called "index.html" in the project directory. index.html
<!DOCTYPE html>
    <title>PyQuery Test!</title>

  <h1>PyQuery is AWESOME!</h1>
  <p><a href="">PyQuery</a> is a Python port of the famous <a href="">jQuery</a> JavaScript library.
  <h2>What is it Good For?</h2>
  <ul id="pitch">
    <li>It makes parsing files a <strong>SNAP</strong>!</li>
    <li>DOM Manipulation is EASY!</li>
    <li>You <em>never</em> have to worry about confusing syntax</li>
Now fire up Python. (Make sure your virtualenv is still activated!)
$ python
Python 2.6.6 (r266:84292, Mar 25 2011, 19:36:32) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
First we import PyQuery from the pyquery package.
>>> from pyquery import PyQuery
Now let's read in our index.html file and store it to a string.
>>> html = open("index.html", 'r').read()
Now we instantiate a PyQuery object, passing in our html string. To keep things looking familiar, let the instantiating object be named jQuery!
>>> jQuery = PyQuery(html)
Now we can traverse this document using the selectors we've grown to love through CSS and jQuery. It might look strange that we're assigning jQuery to something, but at this point, we use this jQuery variable JUST like we use $ in our JavaScript. For example, let's get the title tag.
>>> jQuery("title").text()
'PyQuery Test!'
jQuery developers that have created their own plugin may already be comfortable using jQuery in place of $ in their JS. Let's mess around with PyQuery some more.
>>> jQuery("li").eq(1).text()
'DOM Manipulation is EASY!'
>>> jQuery("a") # The 'jQuery Object' we're used to is now a list
[<a>, <a>]
>>> for x in jQuery("a"): # We can do for-loops as normal in Python
... print jQuery(x).text()
Get the HTML of the first li element.
>>> jQuery("ul").children().eq(0).html()
u'It makes parsing files a <strong>SNAP</strong>!'

Remote Files

Wanna parse a remote file? No problem!
>>> jQuery = PyQuery(url="")
>>> jQuery("title").text()
"Web Design that Doesn't Suck | Vert Studios | Tyler, Texas"


Now that we've given you a nice kickstart of PyQuery, your knowledge of jQuery coupled with the PyQuery API provides sufficient power to parse XML/HTML documents. September 20, 2011
About the Author:

Joseph is the lead developer of Vert Studios Follow Joseph on Twitter: @Joe_Query
Subscribe to the blog: RSS
Visit Joseph's site: