November 2015 archive

Parsing XML and HTML using xpath and lxml in Python

For the last few years my life has been full of the processing of HTML and XML using the lxml library for Python and the xpath query language. xpath is a query language designed specifically to search XML, unlike regular expressions which should definitely not be used to process XML related languages. Typically this has involved a lot of searching my own code to remind me how to do stuff. This blog post captures some handy snippets to avoid the inevitable Googling, and solidify for me exactly what I’ve been doing for the last few years!

But what does it {xml, html} look like?

xml and html are made up of “elements”, delimited by pointy brackets and attributes which are equal to things:

<element1 attribute1=”thing”>content</element1>

Elements can be nested inside other elements to make a tree structure. A wrinkle to be aware of is the so-called “tail” of an element. This is most often seen with <br/> tags (I think it is general):

<element1 attribute1=”thing”>content</br>tail</element1>

The “content” is accessed using text(), whilst the tail is accessed using .tail.

Web pages are made from HTML which is a “relaxed” XML format. XML is the basis of many other file formats found in the wild (such as GPX and GML). Dealing with XML is very similar to dealing with HTML except for namespaces, which I discuss in more detail at the end of this post.

XPath Helper

Before I get onto xpath I should introduce xpath helper – which is a plugin for Google Chrome which helps you develop xpath queries.

You can find XPath Helper in the Chrome Store, it is free. I use it in combination with the Google Chrome Developer tools, particular the “Inspect Element” functionality. XPath helper allows you to see the results of an xpath query live. You open up the XPath console (Ctrl+shift+x), type in your xpath and you see the results in both in the xpath helper console, and also as highlighting on the page.

You can get automatically generated xpath queries, however typically I have used these just as inspiration since they tend to be rather long and “brittle”.

Loading up the data

My Python scripts nearly always start with the following imports:

import lxml.html
import requests
import requests_cache
requests_cache.install_cache('demo_cache')

requests and requests_cache to access data on the web and lxml.html to parse the HTML. Then I can get a webpage using:

r = requests.get(url)
root = lxml.html.fromstring(r.content)

You might want to make any URLs absolute rather than relative:

root.make_links_absolute(base_url)

If I’m dealing with XML rather than HTML then I might do:

from lxml import etree

And then when it came to loading in a local XML file:

with open(input_file, "rb") as f:
	root = etree.XML(f.read())

XPath queries

With your root element in hand you can now get on with querying. Xpath queries are designed to extract a set of elements or attributes from an XML/HTML document by the name of the element, the value of an attribute on an element, by the relationship an element has with another element or by the content of an element.

Quite often xpath will return elements or lists of elements which, when printed in Python, don’t show you the content you want to see. To get the text content of an element you need to use .text, text_content(), or .tail, and make sure you ask for an array element rather than the whole array.

The follow examples show the key features of xpath. I’m using this blog (http:/www.ianhopkinson.org.uk/) as an example website so you can play along with xpath:

Specifying a complete path with / as separator

title = root.xpath('/html/body/div/div/div[2]/h1')

is the full path to my blog title. Notice how we request the 2nd element of the third set of div elements using div[2] – xpath arrays are one-based, not zero-based.

Specifying a path with wildcards using //

This expression also finds the title but the preamble of /html/body/div/div is absorbed by the // wildcard match:

title = root.xpath('//div[2]/h1')

To obtain the text of the title in Python, rather than an element object, we would do:

title_text = title[0].text.strip() or maybe title_text = title[0].text_content().strip()

text_content() would pick up any tail content, and any text in child elements. I use strip() here to remove leading and trailing whitespace

Selecting attribute values

we’ve seen that //element selects all of the elements of type “element”. We select attribute values like this:

ids = root.xpath('//li/@id')

which selects the id attribute from the list elements (li) on my blog

Specifying an element by attribute

We can select elements which have particular attribute values:

tagcloud = root.xpath('//*[@class="tagcloud"]')

this selects the tag cloud on my blog by selecting elements which having the class attribute “tagcloud”.

Select an element containing some specified text

We can do something similar with the text content of an element:

title = root.xpath(‘//h1[contains(., ‘SomeBeans’)]’)

This selects h1 elements which contain the text “SomeBeans”.

Select via a parent or sibling relationship

Sometimes we want to select elements by their relationship to another element, for example:

subtitle = root.xpath('//h1[contains(@class,"header_title")]/../h2')

this selects the h1 title of my blog (SomeBeans) then navigates to the parent with .. and selects the sibling h2 element (the subtitle “the makings of a small casserole”).

The same effect can be achieved with the following-sibling keyword:

subtitle = root.xpath('//h1[contains(@class,"header_title")]/following-sibling::h2')

XML Namespaces

When dealing with XML, we need to worry about namespaces. In principle the elements of an XML document are described in a schema which can be looked up and is universally unique. In practice the use of namespaces in XML documents can lead to much banging head against wall! This is largely because trivial examples of XML wrangling don’t use namespaces, except as a “special” example.

Here is a fragment of XML defining two namespaces:

<foo:Results xmlns:foo="http://www.foo.com" xmlns="http://www.bah.com">

xmlns:foo defines a namespace whose short form is “foo”, we select elements in this space using a namespace parameter to the xpath query:

records = root.xpath('//foo:Title', namespaces = {"foo": "http://www.foo.com"})

The “catch” here is we also define a default namespace xmlns = “http://www.bah.com”, which means that elements which don’t have a prefix cannot be selected unless we define the namespace in our xpath:

records = root.xpath('//bah:Title', namespaces = {"bah": http://www.bah.com})

Worse than that we need to include our namespace prefix in the query, even though it doesn’t appear in the file!

Conclusion

These snippets cover the majority of the xpath queries I’ve needed over the past few years, I’ll add any others as I find them. I’ve put all the code used here in a GitHub gist.

Xpath is the right tool for the job of extracting information from XML documents, including HTML – do not accept inferior alternatives!

Book Review: Canals: The making of a nation by Liz McIvor

canalsCanals: The making of a Nation by Liz McIvor is a tie-in with a BBC series of the same name, presented by the author. It is about canals in England from the mid-18th century through to the present day although most of the action takes place before the end of the 19th century.

The chapters of the book match the episodes of the series which are thematic, rather than chronological. Each chapter introduces a different topic, loosely tied to a particular canal.

The book starts with a discussion of the growth of London, and the Grand Junction canal linking it to Birmingham. The guild system was a factor in limiting the growth of the capital until the mid-18th century. The “Bubble Act” of 1720, enacted in the aftermath of the South Sea Bubble likely also had an impact. It prevented the formation of any joint stock company without an act of parliament to approve. It was repealed in 1825 before the railways saw their enormous growth. The Grand Junction canal was built as Birmingham became a manufacturing hub and London a great city with many requirements for daily life, and also a showroom to at least the United Kingdom, if not the world.

I was chastened to discover that the Bridgewater canal, one of the earliest of “canal boom” projects of the 18th century is only just up the road from me in Chester. I’d always assumed it was close to the town of a very similar name in Somerset! Bridgewater is named for the Duke of Bridgewater, Francis Egerton, for which the pub just over the road from me is presumably named. The Bridgewater canal was built around 1760, linking the Duke’s coal mines at Worsley to Manchester. With this revelation I realise that the Bridgewater canal and the Liverpool to Manchester railway, the first exclusively steam railway, are sited very close to each other.

Support for manufacture was the theme of canal building in the North of England, and also around Birmingham with canals built to move bulky raw materials to factories placed to benefit from hydraulic power, and benevolent climates for the processing of materials such as cotton. Manufacturers such as Josiah Wedgewood were keen to see their fragile wares safely make the outward journey to the showrooms of London.

The Kennet and Avon canal was built to provide navigable water access from Bristol to London. William Smith, who produced the first geological map of Great Britain is introduced in this chapter. I read more about him in The Map that Changed the World by Simon Winchester. The digging of canal cuts and tunnels reveal the local geology. Nowadays we see canals as bucolic thoroughfares but when they were built they were raw cuts indicating industrialisation.

The Manchester Ship Canal was opened in 1894 to bypass the port of Liverpool, these were the dying days of canal building. 154 died in its construction and 1404 were seriously injured from a workforce of 16,361. For comparison, projects such as the 2012 London Olympics and the close-to-completion Crossrail project are of similar scale yet have casualty numbers hovering around zero although these are best-in-class projects for health and safety. In this chapter McIvor talks more of the Irish “navigators” who built the canals, and something of the early trade union movement.

The families that worked the canals were seen as outsiders, once the long networks were set up they led an itinerant lifestyle with no fixed church or school for their children. The Victorian moralists arguing for improved conditions for the boat families seem to do so from the point of view of pointing out how bloody awful they were!

It’s interesting to see the likes of Thomas Telford and John Rennie cropping up repeatedly in this book. They have the air of rockstar engineers, not a niche found these days. Perhaps this is a result of the work of the Victorian writer, Samuel Smiles, who was very keen on self-improvement and wrote biographies of these men to promote his ideas.

To me the book lacks a little prehistory, the great boom for canal building in the UK was at the end of the 18th century but the very first “pound lock” in England was built in 1566 on the Exeter canal. What went on between these two times? And what was happening elsewhere in the world? Perhaps the answer here is that the canals in Britain never represented a technological revolution, they were always about the social and commercial climate being right.

Canals: The Making of a Nation is an unchallenging read, well-suited to a holiday. If you’re on a canal boat it won’t tell you much about the particular bridges and tunnels you pass over but it will give you a strong feeling for the lives of the people that built and used the canals, and why they were built in the first place.