Dr Administrator

Author's posts

Book Review: Canals: The making of a nation by Liz McIvor

canalsCanals: The making of a Nation by Liz McIvor is a tie-in with a BBC series of the same name, presented by the author. It is about canals in England from the mid-18th century through to the present day although most of the action takes place before the end of the 19th century.

The chapters of the book match the episodes of the series which are thematic, rather than chronological. Each chapter introduces a different topic, loosely tied to a particular canal.

The book starts with a discussion of the growth of London, and the Grand Junction canal linking it to Birmingham. The guild system was a factor in limiting the growth of the capital until the mid-18th century. The “Bubble Act” of 1720, enacted in the aftermath of the South Sea Bubble likely also had an impact. It prevented the formation of any joint stock company without an act of parliament to approve. It was repealed in 1825 before the railways saw their enormous growth. The Grand Junction canal was built as Birmingham became a manufacturing hub and London a great city with many requirements for daily life, and also a showroom to at least the United Kingdom, if not the world.

I was chastened to discover that the Bridgewater canal, one of the earliest of “canal boom” projects of the 18th century is only just up the road from me in Chester. I’d always assumed it was close to the town of a very similar name in Somerset! Bridgewater is named for the Duke of Bridgewater, Francis Egerton, for which the pub just over the road from me is presumably named. The Bridgewater canal was built around 1760, linking the Duke’s coal mines at Worsley to Manchester. With this revelation I realise that the Bridgewater canal and the Liverpool to Manchester railway, the first exclusively steam railway, are sited very close to each other.

Support for manufacture was the theme of canal building in the North of England, and also around Birmingham with canals built to move bulky raw materials to factories placed to benefit from hydraulic power, and benevolent climates for the processing of materials such as cotton. Manufacturers such as Josiah Wedgewood were keen to see their fragile wares safely make the outward journey to the showrooms of London.

The Kennet and Avon canal was built to provide navigable water access from Bristol to London. William Smith, who produced the first geological map of Great Britain is introduced in this chapter. I read more about him in The Map that Changed the World by Simon Winchester. The digging of canal cuts and tunnels reveal the local geology. Nowadays we see canals as bucolic thoroughfares but when they were built they were raw cuts indicating industrialisation.

The Manchester Ship Canal was opened in 1894 to bypass the port of Liverpool, these were the dying days of canal building. 154 died in its construction and 1404 were seriously injured from a workforce of 16,361. For comparison, projects such as the 2012 London Olympics and the close-to-completion Crossrail project are of similar scale yet have casualty numbers hovering around zero although these are best-in-class projects for health and safety. In this chapter McIvor talks more of the Irish “navigators” who built the canals, and something of the early trade union movement.

The families that worked the canals were seen as outsiders, once the long networks were set up they led an itinerant lifestyle with no fixed church or school for their children. The Victorian moralists arguing for improved conditions for the boat families seem to do so from the point of view of pointing out how bloody awful they were!

It’s interesting to see the likes of Thomas Telford and John Rennie cropping up repeatedly in this book. They have the air of rockstar engineers, not a niche found these days. Perhaps this is a result of the work of the Victorian writer, Samuel Smiles, who was very keen on self-improvement and wrote biographies of these men to promote his ideas.

To me the book lacks a little prehistory, the great boom for canal building in the UK was at the end of the 18th century but the very first “pound lock” in England was built in 1566 on the Exeter canal. What went on between these two times? And what was happening elsewhere in the world? Perhaps the answer here is that the canals in Britain never represented a technological revolution, they were always about the social and commercial climate being right.

Canals: The Making of a Nation is an unchallenging read, well-suited to a holiday. If you’re on a canal boat it won’t tell you much about the particular bridges and tunnels you pass over but it will give you a strong feeling for the lives of the people that built and used the canals, and why they were built in the first place.

Book review: Effective Computation in Physics by Anthony Scopatz & Kathryn D. Huff

ecipThis next review, of “Effective Computation in Physics” by Anthony Scopatz & Kathryn D. Huff, arose after a brief discussion on twitter with Mike Croucher after my review of “High Performance Python” by Ian Micha Gorelick and Ian Ozsvald. This in the context of introducing students, primarily in the sciences, to programming and software development.

I use the term “software development” deliberately. Scientists have been taught programming (badly, in my view*) for many years. Typically they are given a short course in the first year of their undergraduate training, where they are taught the crude mechanics of a programming language (typically FORTRAN, C, Matlab or Python). They are then left to it, perhaps taking up projects requiring significant coding as final year projects or in PhDs. The thing they have lacked is the wider skillset around programming – what you might call “software development”. The value of this is two-fold – firstly, it is a good training for a scientist to have for careers in science. Secondly, the wider software industry is full of scientists, providing students with a good grounding in this field is no bad thing for their future employability.

The book covers in at least outline all the things a scientist or engineer needs to know about software development. It is inspired by the Software Carpentry and The Hacker Within programmes.

The restriction to physics in the title seems needless to me. The material presented is mostly applicable to any science, and those working in the digital humanities, undertaking programming work. The examples have a physics basis but not to any great depth, and the decorative historical anecdotes are all physics based. Perhaps the only exception to this is the chapter on HDF5 which is a specialised data storage system, some coverage of SQL databases would make a reasonable substitute for a more general course. The chapter on parallel computing could likewise be dropped for a wider audience.

The book is divided into four broad sections. Including in these are chapters on:

  • Command line operations;
  • Programming in Python;
  • Build systems, version control, debugging and testing;
  • Documentation, publication, collaboration and licensing;

Command line operations are covered in two chunks, firstly in the basic navigation of the file system and files followed by a second chapter on “Regular Expressions” which covers find, grep, sed and awk – at a very basic level.

The introduction to Python is similarly staged with initial chapters covering the fundamentals of the core language, with sufficient detail and explanation that I learnt some new things**. Further chapters introduce core Python libraries for data analysis including NumPy, Pandas and matplotlib.

Beyond these core chapters on Python those on version control, debugging and testing are a welcome addition. Our dearest wish at ScraperWiki, a small software company where I worked until recently, was that new recruits and interns would come with at least some knowledge and habit for using source control (preferably Git). It is also nice to see some wider discussion of GitHub and the culture of Pull Requests and issue tracking. Systematic testing is also a useful skill to have, in fact my experience has been that formal testing is most useful for those most physics-like functions.

The final section covers documentation, publication and licensing. I found the short chapter on licensing rather useful, I’ve been working on some code to analyse LIDAR data and have made it public on GitHub, which helpfully asks which license I would like to use. As it turns out I chose the MIT license and this seems to be the correct one for the application. On publication the authors are Latex evangelists but students can chose to ignore their monomania on this point. Latex has a cult-like following in physics which I’ve never understood. I have written papers in Latex but much prefer Microsoft Word for creating documents, although Google Docs is nice for collaborative work. The view that a source control repository issue tracker might work for collaboration beyond coding is optimistic unless academics have changed radically in the last few years.

I’d say the only thing lacking was any mention of pair programming, although to be fair that is more a teaching method than course material. I found I learnt most when I had a goal of my own to work towards, and I had the opportunity to pair with people with more knowledge than I had. Actually, pairing with someone equally clueless in a particular technology can work pretty well.

There is a degree to which the book, particularly in this section strays into a fantasy of how the authors wish computational physics was undertaken, rather than describing how it is actually undertaken.

To me this is the ideal “Software development for scientists” undergraduate text, it is opinionated in places and I occasionally I found the style grating but nevertheless it covers the right bases.

*I’m happy to say this since I taught programming badly to physics undergraduates some years ago!

**People who know my Python skills will realise this is not an earthshattering claim.

Analysing LIDAR data for the UK

I’m currently between jobs for a couple of weeks, so I have time to play with data.

The Environment Agency (EA) has recently released it’s LIDAR data for England amounting to several terabytes of the stuff. LIDAR is a laser ranging technology which gives you the height profile of the surface under inspection. You can get a feel for the data from this excerpt of central Chester:

SJ46-Chester-512x512

The brightness of a pixel shows the height of a feature, so the race course (lower left) appears dark since it is a low flat region close to the River Dee. The CWAC HQ building is tall and appears bright. To the north of the city are a set of three high rise flats, which appear bright. The distinctive cross-shape of the cathedral, with it’s high, bright central tower is also visible. It’s immediately obvious that LIDAR is an excellent tool for picking out the footprint of buildings.

We can use the image above to make a 3D projection view where the brightness of a pixel is mapped to height:

Chester-3D

The orientation for this image is the same as that in the first image, the three tower blocks are visible top right, and the CWAC HQ visible lower left.

The images above used the lowest spatial resolution data, each pixel is 2mx2m. The data have released have spatial resolutions 2m down to 25cm for selected areas. Looking at the areas with the high resolution data available it becomes very obvious what the primary uses of the data are: flood and coastal defences.

You can find the LIDAR data here. It’s divided up into several datasets. Surface data gives height information including all objects on the land such as buildings, trees, vehicles and so forth whilst Terrain data is processed to remove these artefacts and show the pristine land surface.

Composite data are data compiled to give maximum coverage by combining data from surveys conducted in different years and at different resolutions whilst Tile data are the underlying raw data collected in different years and different resolutions. The coverage sliders show the coverage of each dataset. The data are for England only.

The images of Chester shown above are an excerpt from a 10kmx10km tile, shown below:

SJ46

Chester is on the left of this image, above the dark bend of River Dee flood plain. To the right hand side we can see the valley of the River Gowy, and its tributaries – features which are not obvious on the ground or in Google Maps. The large black area is where there is no data, smaller irregular black seem to correlate with water, you might just be able to pick out the line of the Shropshire Union canal cutting through the middle of the image.

I used Chester as an illustration because that’s where I live. I started looking at this data because I was curious, and I’ve spent a happy few days downloading data for lots of different places and playing with it.

It’s great to see data like this being released under permissive conditions. The Environment Agency has been collecting this data for its own purposes, and it’s been available from them commercially for a while – no doubt as a result of a central government edict to maximise revenue from it.

Opening the data like this means the curious can have a rummage, and perhaps others will find a commercial value in it.

I’ve included a few more images below. After them you can see the technical details of how to process these data and make the visualisations for yourself, the code is all in this GitHub repository:

https://github.com/IanHopkinson/defra-lidar-viewer

It is shared under the MIT license.

Liverpool in 3D with the Radio City tower

Liverpool-3D

Liverpool Metropolitan Cathedral at 1m resolution

Liverpool-Metropolitan-cathedral-3D

St Paul’s Cathedral

StPauls-3D

Technical Details

The code used to make the figures in this blog post can be found here:

https://github.com/IanHopkinson/defra-lidar-viewer

The GitHub repository contains a readme file which describes the code, and provides links to the original data, other useful commentary and the numerous bits of code I borrowed from the internet.

The data start as sets of zipped text file archives, each archive contains the data for a 10kmx10km OS National Grid square – Chester is in the SJ46 cell. An archive contains a maximum of 100 text files, each one containing data for a single 1kmx1km square, the size of this file depends on the resolution of the data. I wrote a Python program to read the data for a 10kmx10km cell and convert it into a PNG format image. This program also calculates the bounding box in latitude and longitude for the cell. The processing program works fine for 2m and 1m resolution data. It works just about for 50cm data but is slow and throws memory errors. For 25cm resolution data it doesn’t yet work.

I made a visualisation using the leaflet.js library which allows you to overlay the PNG images generated above onto OpenStreetMap maps. The opacity of the image can be varied with a slider so that you can match LIDAR features to map features. The registration between the two data sources is pretty good but there are systematic problems which I believe might be due to different mapping projections being used by the Ordnance Survey and OpenStreetMap.

map-overlay

A second visualisation tool uses the three.js library to make an interactive 3D view. The input data are manual crops of approximately 512×512 from the raw PNGs, I did this using Paint .NET but other image editors would work fine. Larger images work but they are smoothed to 512×512 in the rendering. A gotcha here is that the revision number of the three.js library is important – the code for this visualisation leant heavily on previous work by others, and whilst integrating new functionality it was important to use three.js source files from the same revision. This visualisation allows you to manipulate the view with the mouse, it takes while to load up but once loaded it is pretty fast. Trying to upload a subsequent image doesn’t work.

3D-view

I’m still working on the code, I’d like to be able to process the 25cm data and it would be good to select an area from the map and convert it to 3D view automatically.

Book review: High Performance Python by Micha Gorelick & Ian Ozsvald

highperformancepythonHigh Performance Python by Micha Gorelick and Ian Ozsvald is nominally a book about improving the speed and memory performance of your programs. Along the way it provides insight into some more advanced aspects of Python programming, including how the language works under the hood.

The book starts with tools for analysing the speed and memory performance of programs at global, function and line level. The authors emphasis the importance making these measurements, and using unit testing to ease the process of optimisation. Blindly optimising where you *think* the problem lies is never a good idea.

The next set of chapters talk about some core Python data structures a little about their implementation and relative performance. These include lists and tuples, dictionaries and sets, iterators and generators and matrices and vectors.It is here that the numpy library is introduced, and it is treated almost as a core Python library in its importance.

The difference between range and xrange in Python 2 is striking: if you wish to execute a loop some number of times then range builds a list of size that number of elements and xrange makes a generator and therefore xrange uses far less memory.

The next few chapters cover compiling Python to C for speed increases, concurrency, the multiprocessing module and clusters. Typically chapters take an example (Julia sets, diffusion equations, estimating pi, finding primes) and demonstrate the speedups which can be made, from the routine to the ridiculous. The authors point out when further optimisation is a bit pointless.

For compiling to C, there are a number of options of varying coverage and maturity. Cython offers the most mature, widest coverage but at the cost of making breaking changes to code. Other newer solutions include Numba and PyPy, they do not require breaking changes to code but they are less mature and in the case of PyPy do not support the important numpy library.

Concurrency is about making better use of a single processor using asynchronous methods, here there are libraries such as gevent and tornado.

For parallel processing most focus is on the multiprocessing library, most of the book is platform agnostic but this chapter is based on the Linux implementation of Python.I hadn’t realised before that “embarrassingly parallel” had a specific meaning i.e. that there is no need for interprocess communication for the problem at hand.

The coverage of computing clusters is fairly cursory, this isn’t really the focus of this book and as the authors highlight: running clusters of machines can bring a significant administration overhead.

The book finishes with a chapter on reducing RAM usage, either by choice of intrinsic data types or using probabilistic data structures such as the Bloom filter which offer an approximate answer for vastly less memory usage. Also included are the Morris Counter which provides an approximate counter in 1 byte of storage, I must admit to being bemused as to when I would need such a thing.

Finally, there are what I refer to as “war stories” from practitioners in the field. I really liked these, one of the difficulties in working in technology is the constant stream of options to choose from, often with no clear frontrunner, so learning how others have approached problems is really handy. Here Celery, the Distributed Task Queue, and ElasticSearch get multiple mentions.

Overall the book is well-produced and readable. It occasionally lapses into the problem of inviting the reader to admire the colour in a greyscale printed plot. In other places I felt the example code could have presented up front in its entirety rather than being dribbled out in bits. In the chapter on Using less RAM, I felt important things were discussed (tries and directed-acyclic word graphs (DAWGs) before they were introduced which was a bit confusing. Tries and DAWGs are systems for the compact storage of text, and are both tree-like structures – I hadn’t come across these before.

In some ways this book is more about productionisation rather than performance. For the straightforward non-production data analysis work I’ve done I can imagine being a bit smarter about my choice of data structures and using profiling to be aware the slow points are as a result of this book. In the repeated reanalysis cycle it is nice to have something run in a minute rather than 20, but is it worth a day of development time? I would likely only turn to compilation, concurrency and multiprocessing if I were going to use a particular analysis regularly, or my anticipated run time was going to be measured in days without optimization.

I recommend this book to anyone looking to advance their understanding of Python, and speed up their code.

A Docker environment for Windows (October 2015 edition)

This blog post provides an outline method for installing a nice environment for developing in Python using Docker on a Windows 10 machine. Hopefully I have provided sufficient of the error messages I encountered that both myself and others will find this post when in distress!

The past three years I’ve been working for ScraperWiki as a data scientist, this has meant a degree of coding in Python and interacting with my colleagues, and some customers, who use Linux (principally Ubuntu) or OS X. I have continued to use a Windows laptop. You can see my review of it here.

Until recently my setup was based on a core installation of Python and a whole bunch of handy libraries using Python(x,y). I also installed Git for Windows which gave me a shell prompt, and the command-line git commands along with some fraction of the bash environment. I also installed msysgit which provided further Linux style enhancements to my shell. I configured my shell so I could get ssh access to ScraperWiki servers in the cloud. For reasons I can’t recall I also installed ansicon.exe which gives the Windows Command prompt some of the colour highlighting of a modern shell prompt.

With this setup I could do most of what I needed from Windows, and if I had to I could fire up my Ubuntu VM and work in there. Typically I did this when I had some tricky libraries to install, or I wanted to be sure I could deploy onto ScraperWiki’s servers in the cloud. I never really got virtual environments working nicely on Windows – virtualenvwrapper, which makes such things, nicer is challenging to configure on Windows.

Students of this sort of thing will appreciate that the configuration described above is reached with a degree of trial and error, and a lot of Googling of error messages.

Times have changed and this setup was getting a bit long in the tooth, the environment around me was also changing – we started using Docker. I couldn’t get code using the Python requests library to run because of problems with SSL. Also, all the cool kids were talking about Python 3 and how new projects should all be in Python 3. I couldn’t work out how to add Python 3 to my Python(x,y) installation, and furthermore I was currently tied to 32-bit rather than 64-bit Python. ScraperWiki had recently done some work on making an easily deployable Python application and identified that the Anaconda Python distribution by ContinuumIO was the way to go.

Installing Python 3 and 2 using Anaconda

This worked very smoothly, there is an installer here. I had Python 3.4.3 (64-bit) working in the twinkling of an eye, and from my bash prompt I could now run that Python code which was previously broken due to OpenSSL problems. However, all was not rosy since it turns out my latest project was accidently Python 3 compatible, whilst my older projects were not. I therefore needed Python 2 as well. In principle, with Anaconda this is as simple as doing:

conda create -n python2 python=2.7 anaconda

and then

source activate python2

This puts you into a Python 2 virtual environment which will run your old code. However, it doesn’t work from the Git Bash prompt, you need to use a Windows Command prompt, as discussed here. But at least I now have the latest whizzy Python 3 installation and I can also run Python 2, when required. It’s worth noting that installing new libraries on Python under Windows has become rather easier with newer versions of pip, I believe due to the introduction of pip wheel. In the past installing some libraries was a pain because of a need to compile binary components.

Using Docker on Windows with Docker Toolbox and the Git SDK

The next task was to get support for Docker, the container system. You can find out more about Docker in my blog post here. Essentially it is a method for running an application in an isolation unit which is defined by a simple Dockerfile, largely removing problems of dependencies and versions. Docker is intrinsically a Linux technology, it relies on several deeply embedded components of the operating system and so does not run on Windows. However, you can boot up a very lightweight Linux-based VM and run Docker images on that from either Windows or OS X. Until recently this was done using boot2docker. The new way is to use the Docker Toolbox. I held off installing this until it became Windows 10 compatible since as a neophile I have obviously upgraded to Windows 10 at earliest opportunity. Docker Toolbox installs VirtualBox to run a VM to host Docker and Git for Windows to provide a bash shell prompt, as well as the Docker commandline tools.

I found installing Docker Toolbox relatively smooth although I had a problem with it finding ssh key files with an error message “open <filepath\ca.pem : The system cannot find the file specified” which was fixed by regenerating the key files:

docker-machine regenerate-certs default

But this alone does not give me the right workflow since ScraperWiki make heavy use of Make to build and run containers and Git for Windows does not come configured with Make. You can see this in action for the Simple API we made for the NewsReader Project. I used the Git for Windows SDK to provide Make and other build tools. This is designed for use by Git for Windows developers, it’s based on msys2 which I also tried to install but which errored on a couple of steps. The Git SDK is more verbose in its installation appeared to install cleanly.

Once we have Git for Windows SDK we need to use its git-bash to launch Docker Quickstart Terminal (rather than the version provided by the Toolbox), this means changing the command executed by the Docker Quickstart Terminal shortcut from:

"C:\Program Files\Git\git-bash.exe" "C:\Program Files\Docker Toolbox\start.sh"

to

C:\git-sdk-64\git-bash.exe "C:\Program Files\Docker Toolbox\start.sh"

Update 2016-03-21: I modified start.sh to give the docker-machine binary an absolute path, this means I can launch a plain Git Bash shell and run the start.sh script later, if required. This change requires further modification to make sure paths were properly escaped. You can see my version of start.sh here: https://gist.github.com/IanHopkinson/85453a90212eb6627f29

Simply trying to run the Git SDK version of the make tool does not seem to work, you get an error like “unable to make temporary trusted Dockerfile”.

We’re into the final straight now!

My final problem was that when I tried to make my previously working application it failed with an error message:

IOError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/newsreader-demo']

The problem seems to be the way in which msys2 handles paths in Windows it needs to have two preceding //, rather than one, as described here. So all I need to do is change this line in my Makefile

@docker run -p 8000:8000 --read-only --rm --volume /tmp -e NEWSREADER_PUBLIC_API_KEY ianhopkinson/newsreader_demo

To this:

@docker run -p 8000:8000 --read-only --rm --volume //tmp -e NEWSREADER_PUBLIC_API_KEY ianhopkinson/newsreader_demo

Can you see what I did there?

Update 2015-10-14 – interactive shells into docker

If you try to get an interactive shell on a container then you get an error like:

cannot enable tty mode on non tty input

To avoid this you can use winpty:

winpty docker exec -i -t [CONTAINER_NAME] bash

There’s some discussion of this on the Docker Toolbox issue tracker

Update 2015-10-22 – Which Python are you using?

It turns out I was accidentally using the Python shipped with Git for Windows SDK, rather than the Anaconda version I had so carefully installed. I fixed this by adding this to my .profile file:

export PATH=/c/anaconda3/:$PATH

I didn’t spot it earlier because I checked Python version by running ipython rather than python.

Concluding thoughts

I wrote this partly in frustration at the amount of time I spent getting this all fixed up, and the fact that I couldn’t stop until I had fixed it. The scheme above worked for me but I suspect it is quicker and easier to do on a laptop with no history.

There’s no doubt that the situation is better than I found it 3 years ago but it is still a painful process involving much trial and error. Docker brings great benefits for developers, and once it is working makes sharing your work across multiple users very straightforward.