Dec 17 2015

Book review: The Invention of Nature by Andrea Wulf

By SomeBeans in Book Reviews

The Invention of Nature by Andrea Wulf is subtitled The Adventures of Alexander von Humboldt – this is his biography.

Alexander von Humboldt was born in Berlin in 1769, he died in 1859. The year in which On the Origin of Species was published. He was a naturalist of a Romantic tendency, born into an aristocratic family, giving him access to the Prussian court.

He made a four year journey to South America in 1800 which he reported (in part) in his book Personal Narratives, which were highly influential – inspiring Charles Darwin amongst many others. On this South American trip he made a huge number of observations across the natural and social sciences and was sought after by the newly formed US government as the Spanish colonies started to gain independence. Humboldt was a bit of a revolutionary at heart, looking for the liberation of countries, and also of slaves. This was one of his bones of contention with his American friends.

His key scientific insight was to see nature as an interconnected web, a system, rather than a menagerie of animals created somewhat arbitrarily by God. As part of this insight he saw the impact that man made on the environment, and in some ways inspired what was to become the environmentalist movement.

For Humboldt the poetry and art of his observations were as important as the observations themselves. He was a close friend of Goethe who found him a great inspiration, as did Henry David Thoreau. This was at the time when Erasmus Darwin was publishing his “scientific poems”. This is curious to the eye of the modern working scientist, modern science is not seen as a literary exercise. Perhaps a little more effort is spent on the technical method of presentation for visualisations but in large part scientific presentations are not works of beauty.

Humboldt was to go voyaging again in 1829, conducting a whistle-stop 15,000 mile 25 week journey across Russia sponsored by the government. On this trip he built on his earlier observations in South America as well as carrying out some mineral prospecting observations for his employers.

Despite a paid position in the Prussian court in Berlin he much preferred to spend his time in Paris, only pulled back to Berlin as the climate in Paris became less liberal and his paymaster more keen to see value for money.

Personally he seemed to be a mixed bag, he was generous in his support of other scientists but in conversation seems to have been a force of nature, Darwin came away from a meeting with him rather depressed – he had not managed to get a word in edgewise!

I’m increasingly conscious of how the climate of the time influences the way we write about the past. This seems particularly the case with The Invention of Nature. Humboldt’s work on what we would now call environmentalism and ecology are highly relevant today. He was the first to talk so explicitly about nature as a system, rather than a garden created by God. He pre-figures the study of ecology, and the more radical Gaia Hypothesis of James Lovelock. He was already alert to the damage man could do to the environment, and potentially how he could influence the weather if not the climate. There is a brief discussion of his potential homosexuality which seems to me another theme in keeping with modern times.

The Invention of Nature is sub-subtitled “The Lost Hero of Science”, this type of claim is always a little difficult. Humboldt was not lost, he was famous in his lifetime. His name is captured in the Humboldt Current, the Humboldt Penguin plus many further plants, animals and geographic features. He is not as well-known as he might be for his theories of the interconnectedness of nature, in this area he was eclipsed by Charles Darwin. In the epilogue Wulf suggests that part of his obscurity is due to anti-German sentiment in the aftermath of two World Wars. I suspect the area of the “appropriate renownedness of scientific figures of the past” is ripe for investigation.

The Invention of Nature is very readable. There are seven chapters illustrating Humboldt’s interactions with particular people (Johann Wolfgang von Goethe, Thomas Jefferson, Simon Bolivar, Charles Darwin, Henry David Thoreau, George Perkins Marsh, Ernst Haeckel and John Muir). Marsh was involved in the early environmental movement in the US, Muir in the founding of the Yosemite National Park (and other National Parks). At first I was a little offended by this: I bought a book on Humboldt, not these other chaps! However, then I remembered I actually prefer biographies which drift beyond the core character and this approach is very much in the style of Humboldt himself.

biography, history of science, women writers

Nov 24 2015

Parsing XML and HTML using xpath and lxml in Python

By SomeBeans in Technology

For the last few years my life has been full of the processing of HTML and XML using the lxml library for Python and the xpath query language. xpath is a query language designed specifically to search XML, unlike regular expressions which should definitely not be used to process XML related languages. Typically this has involved a lot of searching my own code to remind me how to do stuff. This blog post captures some handy snippets to avoid the inevitable Googling, and solidify for me exactly what I’ve been doing for the last few years!

But what does it {xml, html} look like?

xml and html are made up of “elements”, delimited by pointy brackets and attributes which are equal to things:

<element1 attribute1=”thing”>content</element1>

Elements can be nested inside other elements to make a tree structure. A wrinkle to be aware of is the so-called “tail” of an element. This is most often seen with <br/> tags (I think it is general):

<element1 attribute1=”thing”>content</br>tail</element1>

The “content” is accessed using text(), whilst the tail is accessed using .tail.

Web pages are made from HTML which is a “relaxed” XML format. XML is the basis of many other file formats found in the wild (such as GPX and GML). Dealing with XML is very similar to dealing with HTML except for namespaces, which I discuss in more detail at the end of this post.

XPath Helper

Before I get onto xpath I should introduce xpath helper – which is a plugin for Google Chrome which helps you develop xpath queries.

You can find XPath Helper in the Chrome Store, it is free. I use it in combination with the Google Chrome Developer tools, particular the “Inspect Element” functionality. XPath helper allows you to see the results of an xpath query live. You open up the XPath console (Ctrl+shift+x), type in your xpath and you see the results in both in the xpath helper console, and also as highlighting on the page.

You can get automatically generated xpath queries, however typically I have used these just as inspiration since they tend to be rather long and “brittle”.

Loading up the data

My Python scripts nearly always start with the following imports:

import lxml.html
import requests
import requests_cache
requests_cache.install_cache('demo_cache')

requests and requests_cache to access data on the web and lxml.html to parse the HTML. Then I can get a webpage using:

r = requests.get(url)
root = lxml.html.fromstring(r.content)

You might want to make any URLs absolute rather than relative:

root.make_links_absolute(base_url)

If I’m dealing with XML rather than HTML then I might do:

from lxml import etree

And then when it came to loading in a local XML file:

with open(input_file, "rb") as f:
	root = etree.XML(f.read())

XPath queries

With your root element in hand you can now get on with querying. Xpath queries are designed to extract a set of elements or attributes from an XML/HTML document by the name of the element, the value of an attribute on an element, by the relationship an element has with another element or by the content of an element.

Quite often xpath will return elements or lists of elements which, when printed in Python, don’t show you the content you want to see. To get the text content of an element you need to use .text, text_content(), or .tail, and make sure you ask for an array element rather than the whole array.

The follow examples show the key features of xpath. I’m using this blog (http:/www.ianhopkinson.org.uk/) as an example website so you can play along with xpath:

Specifying a complete path with / as separator

title = root.xpath('/html/body/div/div/div[2]/h1')

is the full path to my blog title. Notice how we request the 2nd element of the third set of div elements using div[2] – xpath arrays are one-based, not zero-based.

Specifying a path with wildcards using //

This expression also finds the title but the preamble of /html/body/div/div is absorbed by the // wildcard match:

title = root.xpath('//div[2]/h1')

To obtain the text of the title in Python, rather than an element object, we would do:

title_text = title[0].text.strip() or maybe title_text = title[0].text_content().strip()

text_content() would pick up any tail content, and any text in child elements. I use strip() here to remove leading and trailing whitespace

Selecting attribute values

we’ve seen that //element selects all of the elements of type “element”. We select attribute values like this:

ids = root.xpath('//li/@id')

which selects the id attribute from the list elements (li) on my blog

Specifying an element by attribute

We can select elements which have particular attribute values:

tagcloud = root.xpath('//*[@class="tagcloud"]')

this selects the tag cloud on my blog by selecting elements which having the class attribute “tagcloud”.

Select an element containing some specified text

We can do something similar with the text content of an element:

title = root.xpath(‘//h1[contains(., ‘SomeBeans’)]’)

This selects h1 elements which contain the text “SomeBeans”.

Select via a parent or sibling relationship

Sometimes we want to select elements by their relationship to another element, for example:

subtitle = root.xpath('//h1[contains(@class,"header_title")]/../h2')

this selects the h1 title of my blog (SomeBeans) then navigates to the parent with .. and selects the sibling h2 element (the subtitle “the makings of a small casserole”).

The same effect can be achieved with the following-sibling keyword:

subtitle = root.xpath('//h1[contains(@class,"header_title")]/following-sibling::h2')

XML Namespaces

When dealing with XML, we need to worry about namespaces. In principle the elements of an XML document are described in a schema which can be looked up and is universally unique. In practice the use of namespaces in XML documents can lead to much banging head against wall! This is largely because trivial examples of XML wrangling don’t use namespaces, except as a “special” example.

Here is a fragment of XML defining two namespaces:

<foo:Results xmlns:foo="http://www.foo.com" xmlns="http://www.bah.com">

xmlns:foo defines a namespace whose short form is “foo”, we select elements in this space using a namespace parameter to the xpath query:

records = root.xpath('//foo:Title', namespaces = {"foo": "http://www.foo.com"})

The “catch” here is we also define a default namespace xmlns = “http://www.bah.com”, which means that elements which don’t have a prefix cannot be selected unless we define the namespace in our xpath:

records = root.xpath('//bah:Title', namespaces = {"bah": http://www.bah.com})

Worse than that we need to include our namespace prefix in the query, even though it doesn’t appear in the file!

Conclusion

These snippets cover the majority of the xpath queries I’ve needed over the past few years, I’ll add any others as I find them. I’ve put all the code used here in a GitHub gist.

Xpath is the right tool for the job of extracting information from XML documents, including HTML – do not accept inferior alternatives!

data science, lxml, xpath

Nov 14 2015

Book Review: Canals: The making of a nation by Liz McIvor

By SomeBeans in Book Reviews

canals Canals: The making of a Nation by Liz McIvor is a tie-in with a BBC series of the same name, presented by the author. It is about canals in England from the mid-18th century through to the present day although most of the action takes place before the end of the 19th century.

The chapters of the book match the episodes of the series which are thematic, rather than chronological. Each chapter introduces a different topic, loosely tied to a particular canal.

The book starts with a discussion of the growth of London, and the Grand Junction canal linking it to Birmingham. The guild system was a factor in limiting the growth of the capital until the mid-18th century. The “Bubble Act” of 1720, enacted in the aftermath of the South Sea Bubble likely also had an impact. It prevented the formation of any joint stock company without an act of parliament to approve. It was repealed in 1825 before the railways saw their enormous growth. The Grand Junction canal was built as Birmingham became a manufacturing hub and London a great city with many requirements for daily life, and also a showroom to at least the United Kingdom, if not the world.

I was chastened to discover that the Bridgewater canal, one of the earliest of “canal boom” projects of the 18th century is only just up the road from me in Chester. I’d always assumed it was close to the town of a very similar name in Somerset! Bridgewater is named for the Duke of Bridgewater, Francis Egerton, for which the pub just over the road from me is presumably named. The Bridgewater canal was built around 1760, linking the Duke’s coal mines at Worsley to Manchester. With this revelation I realise that the Bridgewater canal and the Liverpool to Manchester railway, the first exclusively steam railway, are sited very close to each other.

Support for manufacture was the theme of canal building in the North of England, and also around Birmingham with canals built to move bulky raw materials to factories placed to benefit from hydraulic power, and benevolent climates for the processing of materials such as cotton. Manufacturers such as Josiah Wedgewood were keen to see their fragile wares safely make the outward journey to the showrooms of London.

The Kennet and Avon canal was built to provide navigable water access from Bristol to London. William Smith, who produced the first geological map of Great Britain is introduced in this chapter. I read more about him in The Map that Changed the World by Simon Winchester. The digging of canal cuts and tunnels reveal the local geology. Nowadays we see canals as bucolic thoroughfares but when they were built they were raw cuts indicating industrialisation.

The Manchester Ship Canal was opened in 1894 to bypass the port of Liverpool, these were the dying days of canal building. 154 died in its construction and 1404 were seriously injured from a workforce of 16,361. For comparison, projects such as the 2012 London Olympics and the close-to-completion Crossrail project are of similar scale yet have casualty numbers hovering around zero although these are best-in-class projects for health and safety. In this chapter McIvor talks more of the Irish “navigators” who built the canals, and something of the early trade union movement.

The families that worked the canals were seen as outsiders, once the long networks were set up they led an itinerant lifestyle with no fixed church or school for their children. The Victorian moralists arguing for improved conditions for the boat families seem to do so from the point of view of pointing out how bloody awful they were!

It’s interesting to see the likes of Thomas Telford and John Rennie cropping up repeatedly in this book. They have the air of rockstar engineers, not a niche found these days. Perhaps this is a result of the work of the Victorian writer, Samuel Smiles, who was very keen on self-improvement and wrote biographies of these men to promote his ideas.

To me the book lacks a little prehistory, the great boom for canal building in the UK was at the end of the 18th century but the very first “pound lock” in England was built in 1566 on the Exeter canal. What went on between these two times? And what was happening elsewhere in the world? Perhaps the answer here is that the canals in Britain never represented a technological revolution, they were always about the social and commercial climate being right.

Canals: The Making of a Nation is an unchallenging read, well-suited to a holiday. If you’re on a canal boat it won’t tell you much about the particular bridges and tunnels you pass over but it will give you a strong feeling for the lives of the people that built and used the canals, and why they were built in the first place.

canals, industrial history, women writers

Oct 31 2015

Book review: Effective Computation in Physics by Anthony Scopatz & Kathryn D. Huff

By SomeBeans in Book Reviews

This next review, of “Effective Computation in Physics” by Anthony Scopatz & Kathryn D. Huff, arose after a brief discussion on twitter with Mike Croucher after my review of “High Performance Python” by Ian Micha Gorelick and Ian Ozsvald. This in the context of introducing students, primarily in the sciences, to programming and software development.

I use the term “software development” deliberately. Scientists have been taught programming (badly, in my view*) for many years. Typically they are given a short course in the first year of their undergraduate training, where they are taught the crude mechanics of a programming language (typically FORTRAN, C, Matlab or Python). They are then left to it, perhaps taking up projects requiring significant coding as final year projects or in PhDs. The thing they have lacked is the wider skillset around programming – what you might call “software development”. The value of this is two-fold – firstly, it is a good training for a scientist to have for careers in science. Secondly, the wider software industry is full of scientists, providing students with a good grounding in this field is no bad thing for their future employability.

The book covers in at least outline all the things a scientist or engineer needs to know about software development. It is inspired by the Software Carpentry and The Hacker Within programmes.

The restriction to physics in the title seems needless to me. The material presented is mostly applicable to any science, and those working in the digital humanities, undertaking programming work. The examples have a physics basis but not to any great depth, and the decorative historical anecdotes are all physics based. Perhaps the only exception to this is the chapter on HDF5 which is a specialised data storage system, some coverage of SQL databases would make a reasonable substitute for a more general course. The chapter on parallel computing could likewise be dropped for a wider audience.

The book is divided into four broad sections. Including in these are chapters on:

Command line operations;
Programming in Python;
Build systems, version control, debugging and testing;
Documentation, publication, collaboration and licensing;

Command line operations are covered in two chunks, firstly in the basic navigation of the file system and files followed by a second chapter on “Regular Expressions” which covers find, grep, sed and awk – at a very basic level.

The introduction to Python is similarly staged with initial chapters covering the fundamentals of the core language, with sufficient detail and explanation that I learnt some new things**. Further chapters introduce core Python libraries for data analysis including NumPy, Pandas and matplotlib.

Beyond these core chapters on Python those on version control, debugging and testing are a welcome addition. Our dearest wish at ScraperWiki, a small software company where I worked until recently, was that new recruits and interns would come with at least some knowledge and habit for using source control (preferably Git). It is also nice to see some wider discussion of GitHub and the culture of Pull Requests and issue tracking. Systematic testing is also a useful skill to have, in fact my experience has been that formal testing is most useful for those most physics-like functions.

The final section covers documentation, publication and licensing. I found the short chapter on licensing rather useful, I’ve been working on some code to analyse LIDAR data and have made it public on GitHub, which helpfully asks which license I would like to use. As it turns out I chose the MIT license and this seems to be the correct one for the application. On publication the authors are Latex evangelists but students can chose to ignore their monomania on this point. Latex has a cult-like following in physics which I’ve never understood. I have written papers in Latex but much prefer Microsoft Word for creating documents, although Google Docs is nice for collaborative work. The view that a source control repository issue tracker might work for collaboration beyond coding is optimistic unless academics have changed radically in the last few years.

I’d say the only thing lacking was any mention of pair programming, although to be fair that is more a teaching method than course material. I found I learnt most when I had a goal of my own to work towards, and I had the opportunity to pair with people with more knowledge than I had. Actually, pairing with someone equally clueless in a particular technology can work pretty well.

There is a degree to which the book, particularly in this section strays into a fantasy of how the authors wish computational physics was undertaken, rather than describing how it is actually undertaken.

To me this is the ideal “Software development for scientists” undergraduate text, it is opinionated in places and I occasionally I found the style grating but nevertheless it covers the right bases.

*I’m happy to say this since I taught programming badly to physics undergraduates some years ago!

**People who know my Python skills will realise this is not an earthshattering claim.

data science, Python, source control, testing, women writers

Oct 30 2015

Analysing LIDAR data for the UK

By SomeBeans in Technology

I’m currently between jobs for a couple of weeks, so I have time to play with data.

The Environment Agency (EA) has recently released it’s LIDAR data for England amounting to several terabytes of the stuff. LIDAR is a laser ranging technology which gives you the height profile of the surface under inspection. You can get a feel for the data from this excerpt of central Chester:

The brightness of a pixel shows the height of a feature, so the race course (lower left) appears dark since it is a low flat region close to the River Dee. The CWAC HQ building is tall and appears bright. To the north of the city are a set of three high rise flats, which appear bright. The distinctive cross-shape of the cathedral, with it’s high, bright central tower is also visible. It’s immediately obvious that LIDAR is an excellent tool for picking out the footprint of buildings.

We can use the image above to make a 3D projection view where the brightness of a pixel is mapped to height:

The orientation for this image is the same as that in the first image, the three tower blocks are visible top right, and the CWAC HQ visible lower left.

The images above used the lowest spatial resolution data, each pixel is 2mx2m. The data have released have spatial resolutions 2m down to 25cm for selected areas. Looking at the areas with the high resolution data available it becomes very obvious what the primary uses of the data are: flood and coastal defences.

You can find the LIDAR data here. It’s divided up into several datasets. Surface data gives height information including all objects on the land such as buildings, trees, vehicles and so forth whilst Terrain data is processed to remove these artefacts and show the pristine land surface.

Composite data are data compiled to give maximum coverage by combining data from surveys conducted in different years and at different resolutions whilst Tile data are the underlying raw data collected in different years and different resolutions. The coverage sliders show the coverage of each dataset. The data are for England only.

The images of Chester shown above are an excerpt from a 10kmx10km tile, shown below:

Chester is on the left of this image, above the dark bend of River Dee flood plain. To the right hand side we can see the valley of the River Gowy, and its tributaries – features which are not obvious on the ground or in Google Maps. The large black area is where there is no data, smaller irregular black seem to correlate with water, you might just be able to pick out the line of the Shropshire Union canal cutting through the middle of the image.

I used Chester as an illustration because that’s where I live. I started looking at this data because I was curious, and I’ve spent a happy few days downloading data for lots of different places and playing with it.

It’s great to see data like this being released under permissive conditions. The Environment Agency has been collecting this data for its own purposes, and it’s been available from them commercially for a while – no doubt as a result of a central government edict to maximise revenue from it.

Opening the data like this means the curious can have a rummage, and perhaps others will find a commercial value in it.

I’ve included a few more images below. After them you can see the technical details of how to process these data and make the visualisations for yourself, the code is all in this GitHub repository:

https://github.com/IanHopkinson/defra-lidar-viewer

It is shared under the MIT license.

Liverpool in 3D with the Radio City tower

Liverpool Metropolitan Cathedral at 1m resolution

St Paul’s Cathedral

Technical Details

The code used to make the figures in this blog post can be found here:

https://github.com/IanHopkinson/defra-lidar-viewer

The GitHub repository contains a readme file which describes the code, and provides links to the original data, other useful commentary and the numerous bits of code I borrowed from the internet.

The data start as sets of zipped text file archives, each archive contains the data for a 10kmx10km OS National Grid square – Chester is in the SJ46 cell. An archive contains a maximum of 100 text files, each one containing data for a single 1kmx1km square, the size of this file depends on the resolution of the data. I wrote a Python program to read the data for a 10kmx10km cell and convert it into a PNG format image. This program also calculates the bounding box in latitude and longitude for the cell. The processing program works fine for 2m and 1m resolution data. It works just about for 50cm data but is slow and throws memory errors. For 25cm resolution data it doesn’t yet work.

I made a visualisation using the leaflet.js library which allows you to overlay the PNG images generated above onto OpenStreetMap maps. The opacity of the image can be varied with a slider so that you can match LIDAR features to map features. The registration between the two data sources is pretty good but there are systematic problems which I believe might be due to different mapping projections being used by the Ordnance Survey and OpenStreetMap.

A second visualisation tool uses the three.js library to make an interactive 3D view. The input data are manual crops of approximately 512×512 from the raw PNGs, I did this using Paint .NET but other image editors would work fine. Larger images work but they are smoothed to 512×512 in the rendering. A gotcha here is that the revision number of the three.js library is important – the code for this visualisation leant heavily on previous work by others, and whilst integrating new functionality it was important to use three.js source files from the same revision. This visualisation allows you to manipulate the view with the mouse, it takes while to load up but once loaded it is pretty fast. Trying to upload a subsequent image doesn’t work.

I’m still working on the code, I’d like to be able to process the 25cm data and it would be good to select an area from the map and convert it to 3D view automatically.

data science, javascript, LIDAR, Python

I've worked as a scientist for the last 30 years, at various universities, a large home and personal care company, a startup in Liverpool called The Sensible Code Company (formerly ScraperWiki Ltd), GBG and now as a consultant in data science.

I write about:
* the books I have read, typically science and history (or both), partly as a reminder to myself and partly as a review;
* science, things I have done or things I find interesting;
* technology, programming and gadgets;
politics, and current affairs;
* ...and other stuff as it takes my fancy - holidays, photographs and things I want to remember.

Book review: The Invention of Nature by Andrea Wulf

Parsing XML and HTML using xpath and lxml in Python

But what does it {xml, html} look like?

XPath Helper

Loading up the data

XPath queries

Specifying a complete path with / as separator

Specifying a path with wildcards using //

Selecting attribute values

Specifying an element by attribute

Select an element containing some specified text

Select via a parent or sibling relationship

XML Namespaces

Conclusion

Book Review: Canals: The making of a nation by Liz McIvor

Book review: Effective Computation in Physics by Anthony Scopatz & Kathryn D. Huff

Analysing LIDAR data for the UK

Liverpool in 3D with the Radio City tower

Liverpool Metropolitan Cathedral at 1m resolution

St Paul’s Cathedral

Technical Details

About

Recent Posts

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers

Book review: The Invention of Nature by Andrea Wulf

Parsing XML and HTML using xpath and lxml in Python

But what does it {xml, html} look like?

XPath Helper

Loading up the data

XPath queries

Specifying a complete path with / as separator

Specifying a path with wildcards using //

Selecting attribute values

Specifying an element by attribute

Select an element containing some specified text

Select via a parent or sibling relationship

XML Namespaces

Conclusion

Book Review: Canals: The making of a nation by Liz McIvor

Book review: Effective Computation in Physics by Anthony Scopatz & Kathryn D. Huff

Analysing LIDAR data for the UK

Liverpool in 3D with the Radio City tower

Liverpool Metropolitan Cathedral at 1m resolution

St Paul’s Cathedral

Technical Details

About

Recent Posts

Tags

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers