Category: Technology

Programming, gadgets (reviews thereof) and computers

Analysing LIDAR data for the UK

I’m currently between jobs for a couple of weeks, so I have time to play with data.

The Environment Agency (EA) has recently released it’s LIDAR data for England amounting to several terabytes of the stuff. LIDAR is a laser ranging technology which gives you the height profile of the surface under inspection. You can get a feel for the data from this excerpt of central Chester:

SJ46-Chester-512x512

The brightness of a pixel shows the height of a feature, so the race course (lower left) appears dark since it is a low flat region close to the River Dee. The CWAC HQ building is tall and appears bright. To the north of the city are a set of three high rise flats, which appear bright. The distinctive cross-shape of the cathedral, with it’s high, bright central tower is also visible. It’s immediately obvious that LIDAR is an excellent tool for picking out the footprint of buildings.

We can use the image above to make a 3D projection view where the brightness of a pixel is mapped to height:

Chester-3D

The orientation for this image is the same as that in the first image, the three tower blocks are visible top right, and the CWAC HQ visible lower left.

The images above used the lowest spatial resolution data, each pixel is 2mx2m. The data have released have spatial resolutions 2m down to 25cm for selected areas. Looking at the areas with the high resolution data available it becomes very obvious what the primary uses of the data are: flood and coastal defences.

You can find the LIDAR data here. It’s divided up into several datasets. Surface data gives height information including all objects on the land such as buildings, trees, vehicles and so forth whilst Terrain data is processed to remove these artefacts and show the pristine land surface.

Composite data are data compiled to give maximum coverage by combining data from surveys conducted in different years and at different resolutions whilst Tile data are the underlying raw data collected in different years and different resolutions. The coverage sliders show the coverage of each dataset. The data are for England only.

The images of Chester shown above are an excerpt from a 10kmx10km tile, shown below:

SJ46

Chester is on the left of this image, above the dark bend of River Dee flood plain. To the right hand side we can see the valley of the River Gowy, and its tributaries – features which are not obvious on the ground or in Google Maps. The large black area is where there is no data, smaller irregular black seem to correlate with water, you might just be able to pick out the line of the Shropshire Union canal cutting through the middle of the image.

I used Chester as an illustration because that’s where I live. I started looking at this data because I was curious, and I’ve spent a happy few days downloading data for lots of different places and playing with it.

It’s great to see data like this being released under permissive conditions. The Environment Agency has been collecting this data for its own purposes, and it’s been available from them commercially for a while – no doubt as a result of a central government edict to maximise revenue from it.

Opening the data like this means the curious can have a rummage, and perhaps others will find a commercial value in it.

I’ve included a few more images below. After them you can see the technical details of how to process these data and make the visualisations for yourself, the code is all in this GitHub repository:

https://github.com/IanHopkinson/defra-lidar-viewer

It is shared under the MIT license.

Liverpool in 3D with the Radio City tower

Liverpool-3D

Liverpool Metropolitan Cathedral at 1m resolution

Liverpool-Metropolitan-cathedral-3D

St Paul’s Cathedral

StPauls-3D

Technical Details

The code used to make the figures in this blog post can be found here:

https://github.com/IanHopkinson/defra-lidar-viewer

The GitHub repository contains a readme file which describes the code, and provides links to the original data, other useful commentary and the numerous bits of code I borrowed from the internet.

The data start as sets of zipped text file archives, each archive contains the data for a 10kmx10km OS National Grid square – Chester is in the SJ46 cell. An archive contains a maximum of 100 text files, each one containing data for a single 1kmx1km square, the size of this file depends on the resolution of the data. I wrote a Python program to read the data for a 10kmx10km cell and convert it into a PNG format image. This program also calculates the bounding box in latitude and longitude for the cell. The processing program works fine for 2m and 1m resolution data. It works just about for 50cm data but is slow and throws memory errors. For 25cm resolution data it doesn’t yet work.

I made a visualisation using the leaflet.js library which allows you to overlay the PNG images generated above onto OpenStreetMap maps. The opacity of the image can be varied with a slider so that you can match LIDAR features to map features. The registration between the two data sources is pretty good but there are systematic problems which I believe might be due to different mapping projections being used by the Ordnance Survey and OpenStreetMap.

map-overlay

A second visualisation tool uses the three.js library to make an interactive 3D view. The input data are manual crops of approximately 512×512 from the raw PNGs, I did this using Paint .NET but other image editors would work fine. Larger images work but they are smoothed to 512×512 in the rendering. A gotcha here is that the revision number of the three.js library is important – the code for this visualisation leant heavily on previous work by others, and whilst integrating new functionality it was important to use three.js source files from the same revision. This visualisation allows you to manipulate the view with the mouse, it takes while to load up but once loaded it is pretty fast. Trying to upload a subsequent image doesn’t work.

3D-view

I’m still working on the code, I’d like to be able to process the 25cm data and it would be good to select an area from the map and convert it to 3D view automatically.

A Docker environment for Windows (October 2015 edition)

This blog post provides an outline method for installing a nice environment for developing in Python using Docker on a Windows 10 machine. Hopefully I have provided sufficient of the error messages I encountered that both myself and others will find this post when in distress!

The past three years I’ve been working for ScraperWiki as a data scientist, this has meant a degree of coding in Python and interacting with my colleagues, and some customers, who use Linux (principally Ubuntu) or OS X. I have continued to use a Windows laptop. You can see my review of it here.

Until recently my setup was based on a core installation of Python and a whole bunch of handy libraries using Python(x,y). I also installed Git for Windows which gave me a shell prompt, and the command-line git commands along with some fraction of the bash environment. I also installed msysgit which provided further Linux style enhancements to my shell. I configured my shell so I could get ssh access to ScraperWiki servers in the cloud. For reasons I can’t recall I also installed ansicon.exe which gives the Windows Command prompt some of the colour highlighting of a modern shell prompt.

With this setup I could do most of what I needed from Windows, and if I had to I could fire up my Ubuntu VM and work in there. Typically I did this when I had some tricky libraries to install, or I wanted to be sure I could deploy onto ScraperWiki’s servers in the cloud. I never really got virtual environments working nicely on Windows – virtualenvwrapper, which makes such things, nicer is challenging to configure on Windows.

Students of this sort of thing will appreciate that the configuration described above is reached with a degree of trial and error, and a lot of Googling of error messages.

Times have changed and this setup was getting a bit long in the tooth, the environment around me was also changing – we started using Docker. I couldn’t get code using the Python requests library to run because of problems with SSL. Also, all the cool kids were talking about Python 3 and how new projects should all be in Python 3. I couldn’t work out how to add Python 3 to my Python(x,y) installation, and furthermore I was currently tied to 32-bit rather than 64-bit Python. ScraperWiki had recently done some work on making an easily deployable Python application and identified that the Anaconda Python distribution by ContinuumIO was the way to go.

Installing Python 3 and 2 using Anaconda

This worked very smoothly, there is an installer here. I had Python 3.4.3 (64-bit) working in the twinkling of an eye, and from my bash prompt I could now run that Python code which was previously broken due to OpenSSL problems. However, all was not rosy since it turns out my latest project was accidently Python 3 compatible, whilst my older projects were not. I therefore needed Python 2 as well. In principle, with Anaconda this is as simple as doing:

conda create -n python2 python=2.7 anaconda

and then

source activate python2

This puts you into a Python 2 virtual environment which will run your old code. However, it doesn’t work from the Git Bash prompt, you need to use a Windows Command prompt, as discussed here. But at least I now have the latest whizzy Python 3 installation and I can also run Python 2, when required. It’s worth noting that installing new libraries on Python under Windows has become rather easier with newer versions of pip, I believe due to the introduction of pip wheel. In the past installing some libraries was a pain because of a need to compile binary components.

Using Docker on Windows with Docker Toolbox and the Git SDK

The next task was to get support for Docker, the container system. You can find out more about Docker in my blog post here. Essentially it is a method for running an application in an isolation unit which is defined by a simple Dockerfile, largely removing problems of dependencies and versions. Docker is intrinsically a Linux technology, it relies on several deeply embedded components of the operating system and so does not run on Windows. However, you can boot up a very lightweight Linux-based VM and run Docker images on that from either Windows or OS X. Until recently this was done using boot2docker. The new way is to use the Docker Toolbox. I held off installing this until it became Windows 10 compatible since as a neophile I have obviously upgraded to Windows 10 at earliest opportunity. Docker Toolbox installs VirtualBox to run a VM to host Docker and Git for Windows to provide a bash shell prompt, as well as the Docker commandline tools.

I found installing Docker Toolbox relatively smooth although I had a problem with it finding ssh key files with an error message “open <filepath\ca.pem : The system cannot find the file specified” which was fixed by regenerating the key files:

docker-machine regenerate-certs default

But this alone does not give me the right workflow since ScraperWiki make heavy use of Make to build and run containers and Git for Windows does not come configured with Make. You can see this in action for the Simple API we made for the NewsReader Project. I used the Git for Windows SDK to provide Make and other build tools. This is designed for use by Git for Windows developers, it’s based on msys2 which I also tried to install but which errored on a couple of steps. The Git SDK is more verbose in its installation appeared to install cleanly.

Once we have Git for Windows SDK we need to use its git-bash to launch Docker Quickstart Terminal (rather than the version provided by the Toolbox), this means changing the command executed by the Docker Quickstart Terminal shortcut from:

"C:\Program Files\Git\git-bash.exe" "C:\Program Files\Docker Toolbox\start.sh"

to

C:\git-sdk-64\git-bash.exe "C:\Program Files\Docker Toolbox\start.sh"

Update 2016-03-21: I modified start.sh to give the docker-machine binary an absolute path, this means I can launch a plain Git Bash shell and run the start.sh script later, if required. This change requires further modification to make sure paths were properly escaped. You can see my version of start.sh here: https://gist.github.com/IanHopkinson/85453a90212eb6627f29

Simply trying to run the Git SDK version of the make tool does not seem to work, you get an error like “unable to make temporary trusted Dockerfile”.

We’re into the final straight now!

My final problem was that when I tried to make my previously working application it failed with an error message:

IOError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/newsreader-demo']

The problem seems to be the way in which msys2 handles paths in Windows it needs to have two preceding //, rather than one, as described here. So all I need to do is change this line in my Makefile

@docker run -p 8000:8000 --read-only --rm --volume /tmp -e NEWSREADER_PUBLIC_API_KEY ianhopkinson/newsreader_demo

To this:

@docker run -p 8000:8000 --read-only --rm --volume //tmp -e NEWSREADER_PUBLIC_API_KEY ianhopkinson/newsreader_demo

Can you see what I did there?

Update 2015-10-14 – interactive shells into docker

If you try to get an interactive shell on a container then you get an error like:

cannot enable tty mode on non tty input

To avoid this you can use winpty:

winpty docker exec -i -t [CONTAINER_NAME] bash

There’s some discussion of this on the Docker Toolbox issue tracker

Update 2015-10-22 – Which Python are you using?

It turns out I was accidentally using the Python shipped with Git for Windows SDK, rather than the Anaconda version I had so carefully installed. I fixed this by adding this to my .profile file:

export PATH=/c/anaconda3/:$PATH

I didn’t spot it earlier because I checked Python version by running ipython rather than python.

Concluding thoughts

I wrote this partly in frustration at the amount of time I spent getting this all fixed up, and the fact that I couldn’t stop until I had fixed it. The scheme above worked for me but I suspect it is quicker and easier to do on a laptop with no history.

There’s no doubt that the situation is better than I found it 3 years ago but it is still a painful process involving much trial and error. Docker brings great benefits for developers, and once it is working makes sharing your work across multiple users very straightforward.

The London Underground – Can I walk it?

caniwalkitThere are tube strikes planned for 25th August 2015 and 28th August 2015 with disruption through the week. The nature of the London Underground means that it is not all obvious that walks between stations can be quite short. This blog post introduces a handy tool to help you work out “Can I walk it?

You can find the tool here:

http://www.caniwalkit.co.uk/

To use it start by selecting the station you want to walk from, either by using the “Where am I?” dropdown or by clicking one of the coloured station symbols (or close to it). The map will then refresh, the station you selected is marked by a red disk, the stations within 1.5 miles of the starting station are marked by an orange disk and those more than 1.5 miles away are marked by a blue disk. 1.5 miles is my “walkable” threshold, it takes me about 25 minutes to walk that far. You can enter your own “walkable” threshold in the “I will walk” box and press refresh or select a new starting station to refresh the map.

The station markers will show the station names on mouseover, and the distances to the starting station once it has been selected.

This tool comes with no guarantees, the walking distances are estimated and these estimates may be faulty, particularly for river crossings. Weather conditions may make walking an unpleasant or unwise decision. The tool relies on the user to supply their own reasonable walking threshold. Your mileage may vary.

To give a little background to this project: I originally made this tool using Tableau. It was OK but tied to the Tableau Public platform. I felt it was a little slow and unresponsive. It followed some work I’d done visualising data relating to the London Underground which you can read about here.

As an exercise I thought I’d try to make a “Can I walk it?” web application, re-writing the original visualisation in JavaScript and Python. I’ve been involved with projects like this at ScraperWiki but never done the whole thing for myself. I used the leaflet.js library to provide the mapping, the Flask library in Python to serve the data, Boostrap to make it look okay and Docker containers on Digital Ocean to deploy the application.

The underlying data for this tool comes from Open Street Map, where the locations of all the London Underground stations are encoded as latitude and longitude. With this information in hand it is possible to calculate the distances between stations. Really I want the “walking distance” between stations rather than the crow flies distance which is what this data gives me. Ideally to get the walking distance I’d use Google Directions API but unfortunately this has a rate limit of 2500 calls per day and I need to make about 36000 calls to get all the data I need!

The code is open source and available in this BitBucket repository:

https://bitbucket.org/ian_hopkinson/london-underground-app

Comments and feedback are welcome!

Adventures in Kaggle: Forest Cover Type Prediction


forest_cover_thumb
This post was first published at ScraperWiki.

Regular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.

Figure1

Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.

figure2

We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).

figure3

There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.

Git–notes

logo@2xI’ve discovered that my blog is actually a good place to put things I need to remember see, for example, my blog post on running Ubuntu in a VM on Windows 8.

In this spirit here are my notes on using git, the distributed version control system (DVCS). These are things I picked up around the office at ScraperWiki, I wrote something there about the scheme we use for Git. This is more a compendium of useful git commands.

I use Git on both Windows and Ubuntu and I have accounts with both GitHub and Bitbucket. I’ve configured ssh on my Windows and Ubuntu machines and use that for authentication. I Windows I interact with Git using Git Bash.

Installation

On installing Git I do the following setup, obviously using my own name and email:

git config --global user.name "John Doe"
git config --global user.email [email protected]
git config --global core.editor vim

I can list my config settings using:

git config -l

Starting a repo

To start a new repo we do:

git init

These days I feel bereft if I’m not “pushing” my local repository to an online repository like GitHub or BitBucket. To add a remote repository create one using the service of your choice which will probably ask you to do:

git remote add origin [url]

Alternatively you can clone an existing repository into a subdirectory of your current directory with the name of the repo:

git clone [url]

This one clones into current directory, making a mess if that’s not what you intended!

git clone [url] .

A variant, if you are using a repo with submodules in it, :

git clone –recursive [url]

If you forgot to do the above on first cloning then you can do:

git submodule update –init

Adding and committing files

If you’ve started a new repository then need to add some files to track:

git add [filename]

You don’t have to commit all the changes you made since the last commit, you can select them using the -p option

git add –p

And commit them to the repository with a commit command like:

git commit –m [message]

Alternatively you can add the commit message in your favoured editor with the difference from previous commit shown below:

git commit –a –v

I tend to use an remote repository as a backup so I regularly do:

git push origin HEAD

If someone else is working on the same repository as you then things get more complicated but that’s out of the scope of this post.

Undoing things

If you get your commit message wrong you can edit it with:

git commit --amend

If you decide you change your mind about staging a file for commit:

git reset HEAD [filename]

If you change your mind about the modifications you have made to a file since the last commit then you can revert to the last commit using this **destructive** command:

git checkout -- [filename]

You should be careful doing that since it will obliterate any changes you’ve made to a file, even if you saved them from the editor.

Working out where you are

You can list files in the repo with:

git ls-tree --full-tree -r HEAD

The general command for seeing what is going on is:

git status

This tells you if you have made edits which have not been staged, which branch you are on and files which are not being tracked. Whilst you are working you can see the difference from the previous commit using:

git diff

If you’ve already added files to commit then you need to do:

git diff –cached

You can see a list of all your changes using:

git log

This command gives you more information, in a more compact form:

git log --oneline --graph --decorate

is a good way of seeing the status of your branch and the other branches in the repository. I have aliased this log set of options as:

git lg

To do this I added the following to my ~/.gitconfig file:

[alias]
  
        lg = log --oneline --graph --decorate

Once you’ve commited a bunch of changes you might want to push them to a remote server. This pushes to the remote called origin, and HEAD ensures you push to your current branch. HEAD is Git’s shorthand for the latest commit on the current branch:

git push origin HEAD

Branches

The proceeding commands are how you’d work using a single master branch, if you were working alone on something simple, for example. If you are working with other people or on something more complicated then you probably want to work on a branch, you can make a new branch by doing:

git checkout –b [branch name]

You can find out what other branches are available by doing:

git branch –v -a

Once you are on a branch you can commit changes, and push them onto your remote server, just as if you were on the master branch.

Merging and rebasing

The excitement comes when you want to merge your changes onto the master branch or you want to get changes on your own branch made by someone else and pushed to the remote reposition. The quick and dirty way to do this is using

git pull

This does a fetch and merge all at the same time. The better way is to fetch the changes and then merge them:

git fetch –prune –all
git merge origin/master

If you are working with someone else then you may prefer to merge changes onto the master branch by making a pull request on GitHub or BitBucket.

Accepting Pull Requests from Forks

If someone makes a Pull Request based on their forked copy of a repo then you can download for testing by doing:

git fetch origin pull/ID/head:BRANCHNAME