Tag: data visualisation

Book review: The Information Capital by James Cheshire and Oliver Uberti

Today I review TheInformationCapitalThe Information Capital by James Cheshire and Oliver Uberti – a birthday present. This is something of a coffee table book containing a range of visualisations pertaining to data about London. The book has a website where you can see what I’m talking about (here) and many of the visualisations can be found on James Cheshire’s mappinglondon.co.uk website.

This type of book is very much after my own heart, see for example my visualisation of the London Underground. The Information Capital isn’t just pretty, the text is sufficient to tell you what’s going on and find out more.

The book is divided into five broad themes “Where We Are”, “Who We Are”, “Where We Go”, “How We’re Doing” and “What We Like”. Inevitably the majority of the visualisations are variants on a coloured map but that’s no issue to my mind (I like maps!).

Aesthetically I liked the pointillist plots of the trees in Southwark, each tree gets a dot, coloured by species and the collection of points marks out the roads and green spaces of the borough. The twitter map of the city with the dots coloured by the country of origin of the tweeter is in similar style with a great horde evident around the heart of London in Soho.

The visualisations of commuting look like thistledown, white on a dark blue background, and as a bonus you can see all of southern England, not just London. You can see it on the website (here). A Voroni tessellation showing the capital divided up by the area of influence (or at least the distance to) different brands of supermarket is very striking. To the non-scientist this visualisation probably has a Cubist feel to it.

Some of the charts are a bit bewildering, for instance a tree diagram linking wards by the prevalent profession is confusing and the colouring doesn’t help. The mood of Londoners is shown using Chernoff faces, this is based on data from the ONS who have been asking questions on life satisfaction, purpose, happiness and anxiety since 2011. On first glance this chart is difficult to read but the legend clarifies for us to discover that people are stressed, anxious and unhappy in Islington but perky in Bromley. You can see this visualisation on the web site of the book (here).

The London Guilds as app icons is rather nice, there’s not a huge amount of data in the chart but I was intrigued to learn that guilds are still being created, the most recent being the Art Scholars created in February 2014. Similarly the protected views of London chart is simply a collection of water-colour vistas.

I have mixed feelings about London, it is packed with interesting things and has a long and rich history. There are even islands of tranquillity, I enjoyed glorious breakfasts on the terrace of Somerset House last summer and lunches in Lincoln’s Inn Fields.  But I’ve no desire to live there. London sucks everything in from the rest of the country, government sits there and siting civic projects outside London seems a great and special effort for them. There is an assumption that you will come to London to serve. The inhabitants seem to live miserable lives with overpriced property and hideous commutes, these things are reflected in some of the visualisations in this book. My second London Underground visualisation measured the walking time between Tube station stops, mainly to help me avoid that hellish place at rush hour. There is a version of such a map in The Information Capital.

For those living outside London, The Information Capital is something we can think about implementing in our own area. For some charts this is quite feasible based, as they are, on government data which covers the nation such as the census or GP prescribing data. Visualisations based on social media are likely also doable although will lack weight of numbers. The visualisations harking back to classics such as John Snow’s cholera map or Charles Booth’s poverty maps of are more difficult since there is no comparison to be made in other parts of the country. And other regions of the UK don’t have Boris Bikes (or Boris, for that matter) or the Millennium Wheel.

It’s completely unsurprising to see Tufte credited in the end papers of The Information Capital. There are also some good references there for the history of London, places to get data and data visualisation.

I loved this book, its full of interesting and creative visualisations, an inspiration!

Inordinately fond of beetles… reloaded!


This post was first published at ScraperWiki.

Some time ago, in the era before I joined ScraperWiki I had a play with the Science Museums object catalogue. You can see my previous blog post here. It was at a time when I was relatively inexperienced with the Python programming language and had no access to Tableau, the visualisation software. It’s a piece of work I like to talk about when meeting customers since it’s interesting and I don’t need to worry about commercial confidentiality.

The title comes from a quote by J.B.S. Haldane, who was asked what his studies in biology had told him about the Creator. His response was that, if He existed then he was “inordinately fond of beetles”.

The Science Museum catalogue comprises three CSV files containing information on objects, media and events. I’m going to focus on the object catalogue since it’s the biggest one by a large margin – 255,000 objects in a 137MB file. Each object has an ID number which often encodes the year in which the object was added to the collection; a title, some description, it often has an “item name” which is a description of the type of object, there is sometimes information on the date made, the maker, measurements and whether it represents part or all of an object. Finally, the objects are labelled according to which collection they come from and which broad group in that collection, the catalogue contains objects from the Science Museum, Nation Railway Museum and National Media Museum collections.

The problem with most of these fields is that they don’t appear to come from a controlled vocabulary.

Dusting off my 3 year old code I was pleased to discover that the SQL I had written to upload the CSV files into a database worked almost first time, bar a little character encoding. The Python code I’d used to clean the data, do some geocoding, analysis and visualisation was not in such a happy state. Or rather, having looked at it I was not in such a happy state. I appeared to have paid no attention to PEP-8, the Python style guide, no source control, no testing and I was clearly confused as to how to save a dictionary (I pickled it).

In the first iteration I eyeballed the data as a table and identified a whole bunch of stuff I thought I needed to tidy up. This time around I loaded everything into Tableau and visualised everything I could – typically as bar charts. This revealed that my previous clean up efforts were probably not necessary since the things I was tidying impacted a relatively small number of items. I needed to repeat the geocoding I had done. I used geocoding to clean up the place of manufacture field, which was encoded inconsistently. Using the Google API via a Python library I could normalise the place names and get their locations as latitude – longitude pairs to plot on a map. I also made sure I had a link back to the original place name description.

The first time around I was excited to discover the Many Eyes implementation of bubble charts, this time I now realise bubble charts are not so useful. As you can see below in these charts showing the number of items in each subgroup. In a sorted bar chart it is very obvious which subgroup is most common and the relative sizes of the subgroup. I’ve coloured the bars by the major collection to which they belong. Red is the Science Museum, Green is the National Rail Museum and Orange is the National Media Museum.


Less discerning members of ScraperWiki still liked the bubble charts.


We can see what’s in all these collections from the item name field. This is where we discover that the Science Museum is inordinately fond of bottles. The most common items in the collection are posters, mainly from the National Rail Museum but after that there are bottles, specimen bottles, specimen jars, shops rounds (also bottles), bottle, drug jars, and albarellos (also bottles). This is no doubt because bottles are typically made of durable materials like glass and ceramics, and they have been ubiquitous in many milieu, and they may contain many and various interesting things.


Finally I plotted the place made for objects in the collection, this works by grouping objects by location and then finding latitude and longitude for those group location. I then plot a disk sized by the number of items originating at that location. I filtered out items whose place made was simply “England” or “London” since these made enormous blobs that dominated the map.




You can see a live version of these visualisation, and more on Tableau Public.

It’s an interesting pattern that my first action on uploading any data like this to Tableau is to do bar chart frequency plots for each column in the data, this could probably be automated.

In summary, the Science Museum is full of bottles and posters, Tableau wins for initial visualisations of a large and complex dataset.

Visualising the London Underground with Tableau

This post was first published at ScraperWiki.

I’ve always thought of the London Underground as a sort of teleportation system. You enter a portal in one place, and with relatively little effort appeared at a portal in another place. Although in Star Trek our heroes entered a special room and stood well-separated on platforms, rather than packing themselves into metal tubes.

I read Christian Wolmar’s book, The Subterranean Railway about the history of the London Underground a while ago. At the time I wished for a visualisation for the growth of the network since the text description was a bit confusing. Fast forward a few months, and I find myself repeatedly in London wondering at the horror of the rush hour underground. How do I avoid being forced into some sort of human compression experiment?

Both of these questions can be answered with a little judicious visualisation!

First up, the history question. It turns out that other obsessives have already made a table containing a list of the opening dates for the London Underground. You can find it here, on wikipedia. These sortable tables are a little tricky to scrape, they can be copy-pasted into Excel but random blank rows appear. And the data used to control the sorting of the columns did confuse our Table Xtract tool, until I fixed it – just to solve my little problem! You can see the number of stations opened in each year in the chart below. It all started in 1863, electric trains were introduced in the very final years of the 19th century – leading to a burst of activity. Then things went quiet after the Second World War, when the car came to dominate transport.


Originally I had this chart coloured by underground line but this is rather misleading since the wikipedia table gives the line a station is currently on rather than the one it was originally built for. For example, Stanmore station opened in 1932 as part of the Metropolitan line, it was transferred to the Bakerloo line in 1939 and then to the Jubilee line in 1979. You can see the years in which lines opened here on wikipedia, where it becomes apparent that the name of an underground line is fluid.

So I have my station opening date data. How about station locations? Well, they too are available thanks to the work of folk at Openstreetmap, you can find that data here. Latitude-longitude coordinates are all very well but really we also need the connectivity, and what about Harry Beck’s iconic “circuit diagram” tube map? It turns out both of these issues can be addressed by digitizing station locations from the modern version of Beck’s map. I have to admit this was a slightly laborious process, I used ImageJ to manually extract coordinates.

I’ve shown the underground map coloured by the age of stations below.

Age map2

Deep reds for the oldest stations, on the Metropolitan and District lines built in the second half of the 19th century. Pale blue for middle aged stations, the Central line heading out to Epping and West Ruislip. And finally the most recent stations on the Jubilee line towards Canary Wharf and North Greenwich are a darker blue.

Next up is traffic, or how many people use the underground. The wikipedia page contains information on usage, in terms of millions of passengers per year in 2012 covering both entries and exits. I’ve shown this data below with traffic shown at individual stations by the thickness of the line.


I rather like a “fat lines” presentation of the number of people using a station, the fatter the line at the station the more people going in and out. Of course some stations have multiple lines so get an unfair advantage. Correcting for this it turns out Canary Wharf is the busiest station on the underground, thankfully it’s built for it. Small above ground beneath it is a massive, cathedral-like space.

More data is available as a result of a Freedom of Information request (here) which gives data broken down by passenger action (boarding or alighting), underground line, direction of travel and time of day – broken down into fairly coarse chunks of the day. I use this data in the chart below to measure the “commuteriness” of each station. To do this I take the ratio of people boarding trains in the 7am-10am time slot with those boarding 4pm-7pm. For locations with lots of commuters, this will be a big number because lots of people get on the train to go to work in the morning but not many get on the train in the evening, that’s when everyone is getting off the train to go home.


By this measure the top five locations for “commuteriness” are:

  1. Pinner
  2. Ruislip Manor
  3. Elm Park
  4. Upminster Bridge
  5. Burnt Oak

It was difficult not to get sidetracked during this project, someone used the Freedom of Information Act to get the depths of all of the underground stations, so obviously I had to include that data too! The deepest underground station is Hampstead, in part because the station itself is at the top of a steep hill.

I’ve made all of this data into a Tableau visualisation which you can play with here. The interactive version shows you details of the stations as your cursor floats over them, allows you to select individual lines and change the data overlaid on the map including the depth and altitude data that.

Messier and messier

Regular readers with a good memory will recall I bought a telescope about 18 months ago. I bemoaned the fact that I bought it in late Spring, since it meant it got dark rather late. I will note here that astronomy is generally incompatible with a small child who might wake you up in the middle of the night, requiring attention and early nights.

Since then I’ve taken pictures of the sun, the moon, Jupiter, Saturn and as a side project I also took wide angle photos of the Milky Way and star trails (telescope not required). Each of these bought their own challenges, and awe. The sun because it’s surprisingly difficult to find the thing in you view finder with the serious filter required to stop you blinding yourself when you do find it. The moon because it’s just beautiful and fills the field of view, rippling through the “seeing” or thermal turbulence of the atmosphere. Jupiter because of it’s Galilean moons, first observed by Galileo in 1610. Saturn because of it’s tiny ears, I saw Saturn on my first night of proper viewing. As the tiny image of Saturn floated across my field of view I was hopping up and down with excitement like a child.

I’ve had a bit of a hiatus in the astrophotography over the past year but I’m ready to get back into it.

My next targets for astrophotography are the Deep Sky Objects (DSOs), these are largish faint things as opposed to planets which are smallish bright things. My accidental wide-angle photos clued me into the possibilities here. I’d been trying to photograph constellations, which turn out to be a bit dull, at the end of the session I put the sensitivity of my camera right up and increased the exposure time and suddenly the Milky Way appeared! Even in rural Wales it was only just visible to the naked eye.

Now I’m keen to explore more of these faint objects. The place to start is the Messier Catolog of objects. This was compiled by Charles Messier and Pierre Méchain in the latter half of the 18th century. You may recognise the name Méchain, he was one of the two French men who surveyed France on the cusp of the Revolution to define a value for the meter. Ken Alder’s book The Measure of All Things, describes their adventures.

Messier and Mechain weren’t interested in the deep sky objects, they were interested in comets and compiled the list in order not to be distracted from their studies by other non-comety objects. The list is comprised of star clusters, nebula and galaxies. I must admit to being a bit dismissive of star clusters. The Messier list is by no means exhaustive, observations were all made in France with a small telescope so there are no objects from the Southern skies. But they are ideal for amateur astronomers in the Northern hemisphere since the high tech, professional telescope of the 18th century is matched by the consumer telescope of the 21st.

I’ve know of the Messier objects since I was a child but I have no intuition as to where they are, how bright and how big they are. So to get me started I found some numbers and made some plots.

The first plot shows where the objects are in the sky. They are labelled, somewhat fitfully with their Messier number and common name. Their locations are shown by declination, how far away from the celestial equator an object is, towards the North Pole and right ascension, how far around it is along a line of celestial latitude. I’ve added the moon to the plot in a fixed position close to the top left. As you can see the majority of the objects are North of the celestial equator. The size of the symbols indicates the relative size of the objects. The moon is shown to the same scale and we can see that a number of the objects are larger than the moon, these are often star clusters but galaxies such as Andromeda – the big purple blob on the right and the Triangulum Galaxy are also bigger than the moon. As is the Orion nebula.


So why aren’t we as familiar with these objects as we are with the moon. The second plot shows how bright the Messier objects are and their size. The horizontal axis shows their apparent size – it’s a linear scale so that an object twice as far from the vertical axis is twice as big. Note that these are apparent sizes, some things appear larger than others because they are closer. The Messier The vertical axis shows the apparent brightness, in astronomy brightness is measured in units of “magnitude” which is a logarithmic scale. This means that although the sun is roughly magnitude –26 and the moon is roughly magnitude –13, the sun is 10,000 times bright than the moon. The Messier objects are all much dimmer than Venus, Jupiter and Mercury and generally dimmer than Saturn.



So the Messier objects are often bigger but dimmer than things I have already photographed. But wait, the moon fills the field of view of my telescope. And not only that my telescope has an aperture of f/10 – a measure of it’s light gathering power. This is actually rather “slow” for a camera lens, my “fastest” lens is f/1.4 which represents a 50 fold larger light gathering power.

For these two reasons I have ordered a new lens for my camera, a Samyang 500mm f/6.3 this is going to give me a bigger field of view than my telescope which has a focal length of 1250mm. And also more light gathering power – my new lens should have more than double the light gathering power!

Watch this space for the results of my new purchase!

The BIG Lottery Data


This post was originally published at ScraperWiki.

The UK’s BIG Lottery Fund recently released its grant data since 2004 as a set of lovely CSV files: You can get it yourself here or here. I found it a great opportunity to try out some new tricks with Tableau, and have a bit of a poke around another largish dataset from government. The data runs to a little under 120,000 lines.

The first question to ask is: where is all the money going?

The total awarded is £5,277,058,180 over nearly 10 years. It’s going to 81,386 different organisations. The sizes of grants vary enormously; the biggest, £214,340,846, going to the Big Local Trust, which is an umbrella organisation. Other big recipients include the Royal Society of Wildlife Trusts, who received £59,842,400 for the Local Food programme. The top 10 grants are listed below:

01/03/2012, Big Local Trust  £        214,340,846
15/08/2007, Royal Society of Wildlife Trusts  £          59,842,400
04/10/2007, The Federation of Groundwork Trusts  £          58,306,400
13/05/2008, Sustrans Limited  £          49,980,908
11/10/2012, Life Changes (Trustee) Limited  £          49,338,186
13/12/2011, Forces In Mind Trustee Limited  £          34,808,423
19/10/2007, Natural England  £          30,113,200
01/05/2007, Legacy Trust UK Limited  £          28,850,000
31/07/2007, Sustrans Limited  £          25,023,084
09/04/2008, Falkirk Council  £          25,000,000

Awards like this make determining the true geographic distribution of grants a bit tricky, since they are registered as being awarded to a particular local area – apparently the head office of the applicant – but they are used nationally. There is a regional breakdown of where the money is spent but this classification is to large areas i.e. “England” or “North West”. The Big Local Trust, Life Changes and Forces in Mind are all very recently established – less than a couple of years old. The Legacy Trust was established in 2007 to fund programmes to promote an Olympic legacy.

These are really big grants, but what does the overall distribution of awards look like?

This is shown in the chart below:

Award distribution

It’s a bit complicated because the spread of award sizes is from about £1000 to over £100,000,000 so what I’ve done is taken the logarithm of the award to create the bins. This means that the column marked “3” contains the sum of all awards from £1000 to £9999 and that marked “4” contains the sum of all awards from £10,000 to £99,999. The chart shows that most money is distributed in the column marked “5”, i.e. £100,000 to £999,999. The columns are coloured by the year in which money was awarded, so we can see that there were large grants awarded in 2007 as well as 2011 and 2013.

Everybody loves a word cloud, even though we know it’s not good in terms of data visualisation, a simple bar chart shows the relative frequency of words more clearly. The word cloud below shows the frequency of words appearing in the applicant name field of the data, lots of money going to Communities, Schools, Clubs and councils.


The data also include the founding date for the organisations to which money is awarded, most of them were founded since the beginning of the 20th century. There are quite a few schools and local councils in the list and, particularly for councils we can see the effect of legislation on the foundation of these organisations, there are big peaks in founding dates for councils in 1894 and in 1972-1974, coinciding with a couple of local government acts. There’s a dip in the foundation of bodies funded by the BIG lottery for both the First and Second World Wars, I guess people’s energies were directed elsewhere. The National Lottery started in the UK in late 1994.

Founding year

As a final piece of analysis I thought I’d look at sport; I’m not particularly interested in sport so I let natural language processing find sports for me in applicant names – they are often of the form “Somewhere Cricket/Rugby/Tennis/etc Club”. One way of picking out all the sports awards would be to come up with a list of sports names and compare against that list but I applied a little more cunning: the nltk library will tell you how closely related two words are using the WordNet lexicon which it contains. So I identified sports by measuring how closely related a target word was from the word “sport”. This got off to a shaky start  since I decided to use “cricket” as a test word; “cricket” is as closely related to “sport” as “hamster” – a puzzling result until I realised that the first definition of “cricket” in WordNet relates to the insect! This confusion dispensed with finding all the sports mentioned in the applicant names was an easy task. The list of sports I ended up with was unexceptional.

You can find participation levels in various sports here, I plotted them together with numbers of awards. Sports near the top left have relatively few awards given the number of participants, whilst those bottom right have more awards than would be expected from the number of participants.


Number of clubs vs number of participants

You can see interactive versions of these plots, plus a view more here on Tableau Public.

That’s what I found in the data – what would interest you?


I uploaded the CSV files to a MySQL database before loading into Tableau, I also did a bit of work in Python using the pandas library. In addition to the BIG lottery data I pulled in census data from the ONS and geographic boundary data from Tableau Mapping. You can see all this unfolding on the bitbucket repo I set up to store the analysis. Since Tableau workbook files are XML format they can be usefully stored in source control.