Dec 31 2013

The UK’s BIG Lottery Fund recently released its grant data since 2004 as a set of lovely CSV files: You can get it yourself here or here. I found it a great opportunity to try out some new tricks with Tableau, and have a bit of a poke around another largish dataset from government. The data runs to a little under 120,000 lines.

The first question to ask is: where is all the money going?

The total awarded is £5,277,058,180 over nearly 10 years. It’s going to 81,386 different organisations. The sizes of grants vary enormously; the biggest, £214,340,846, going to the Big Local Trust, which is an umbrella organisation. Other big recipients include the Royal Society of Wildlife Trusts, who received £59,842,400 for the Local Food programme. The top 10 grants are listed below:

01/03/2012, Big Local Trust	£ 214,340,846
15/08/2007, Royal Society of Wildlife Trusts	£ 59,842,400
04/10/2007, The Federation of Groundwork Trusts	£ 58,306,400
13/05/2008, Sustrans Limited	£ 49,980,908
11/10/2012, Life Changes (Trustee) Limited	£ 49,338,186
13/12/2011, Forces In Mind Trustee Limited	£ 34,808,423
19/10/2007, Natural England	£ 30,113,200
01/05/2007, Legacy Trust UK Limited	£ 28,850,000
31/07/2007, Sustrans Limited	£ 25,023,084
09/04/2008, Falkirk Council	£ 25,000,000

Awards like this make determining the true geographic distribution of grants a bit tricky, since they are registered as being awarded to a particular local area – apparently the head office of the applicant – but they are used nationally. There is a regional breakdown of where the money is spent but this classification is to large areas i.e. “England” or “North West”. The Big Local Trust, Life Changes and Forces in Mind are all very recently established – less than a couple of years old. The Legacy Trust was established in 2007 to fund programmes to promote an Olympic legacy.

These are really big grants, but what does the overall distribution of awards look like?

This is shown in the chart below:

It’s a bit complicated because the spread of award sizes is from about £1000 to over £100,000,000 so what I’ve done is taken the logarithm of the award to create the bins. This means that the column marked “3” contains the sum of all awards from £1000 to £9999 and that marked “4” contains the sum of all awards from £10,000 to £99,999. The chart shows that most money is distributed in the column marked “5”, i.e. £100,000 to £999,999. The columns are coloured by the year in which money was awarded, so we can see that there were large grants awarded in 2007 as well as 2011 and 2013.

Everybody loves a word cloud, even though we know it’s not good in terms of data visualisation, a simple bar chart shows the relative frequency of words more clearly. The word cloud below shows the frequency of words appearing in the applicant name field of the data, lots of money going to Communities, Schools, Clubs and councils.

The data also include the founding date for the organisations to which money is awarded, most of them were founded since the beginning of the 20th century. There are quite a few schools and local councils in the list and, particularly for councils we can see the effect of legislation on the foundation of these organisations, there are big peaks in founding dates for councils in 1894 and in 1972-1974, coinciding with a couple of local government acts. There’s a dip in the foundation of bodies funded by the BIG lottery for both the First and Second World Wars, I guess people’s energies were directed elsewhere. The National Lottery started in the UK in late 1994.

As a final piece of analysis I thought I’d look at sport; I’m not particularly interested in sport so I let natural language processing find sports for me in applicant names – they are often of the form “Somewhere Cricket/Rugby/Tennis/etc Club”. One way of picking out all the sports awards would be to come up with a list of sports names and compare against that list but I applied a little more cunning: the nltk library will tell you how closely related two words are using the WordNet lexicon which it contains. So I identified sports by measuring how closely related a target word was from the word “sport”. This got off to a shaky start since I decided to use “cricket” as a test word; “cricket” is as closely related to “sport” as “hamster” – a puzzling result until I realised that the first definition of “cricket” in WordNet relates to the insect! This confusion dispensed with finding all the sports mentioned in the applicant names was an easy task. The list of sports I ended up with was unexceptional.

You can find participation levels in various sports here, I plotted them together with numbers of awards. Sports near the top left have relatively few awards given the number of participants, whilst those bottom right have more awards than would be expected from the number of participants.

You can see interactive versions of these plots, plus a view more here on Tableau Public.

That’s what I found in the data – what would interest you?

Footnotes

I uploaded the CSV files to a MySQL database before loading into Tableau, I also did a bit of work in Python using the pandas library. In addition to the BIG lottery data I pulled in census data from the ONS and geographic boundary data from Tableau Mapping. You can see all this unfolding on the bitbucket repo I set up to store the analysis. Since Tableau workbook files are XML format they can be usefully stored in source control.

data science, data visualisation, scraperwiki

Dec 29 2013

Review of the year: 2013

By SomeBeans in Miscellaneous

My blogging is much reduced this year, at least on my own blog. This is a result of my new job with ScraperWiki and child care, Thomas is now nearly two years old.

I started the year with a couple of posts on my shiny new laptop; working for a startup I’ve escaped from the corporate Dell. One post was on the beast itself – a Sony VAIO, and Windows 8 – Microsoft’s somewhat confusing new operating system offering. The other post was on running Ubuntu on the VAIO. In the past this was a case of setting up dual boot but various innovations make this difficult and there is, in my view, a better solution: a virtual machine.

There wasn’t much ranting this year: I only managed one little one about higher education, and the reluctance amongst lecturers to take any teaching qualifications. The only other marginally opinion piece was on electronic books, where I muttered about DRM limiting the functionality of ebooks.

I managed to read a few books which ended up on my own blog: The Eighth Day of Creation, about the unravelling of the genetic code was a dense, heroic read. The Dinosaur Hunters was light and fluffy. Empire of the Clouds and The Backroom Boys were largely wistful rememberings of Britain’s former greatness in jet aeroplanes and in technology more generally. Chasing Venus and a History of the World in 12 Maps returned to the themes of geodesy and mapping which I’ve explored in the past. Finally, a bit of London history with The Subterranean Railway and Lucy Inglis’ Georgian London. I’ve been following Lucy on twitter since Georgian London was a twinkle in her eye. It’s difficult to choose a favourite amongst these, it’s either History of the World in 12 Maps or Georgian London.

Over on ScraperWiki’s blog I’ve been knocking out blog posts at a great rate, you can see them all here. I did a good deal of book reviewing over there too, my commute into work on the train means I get an hour or so of reading every day – which quickly adds up to a lot of reading! I read about machine learning, data visualisation (this and this), Tableau (this and this), natural language processing, R, Javascript and software engineering. I’m currently ploughing my way through Data Mining: Practical Machine Learning Tools and Techniques. I think my favourite of these was Natural Language Processing with Python. I’m beginning to see the value of the more expensive, better established publishing houses in terms of book quality.

Alongside this I did a few blog posts on new tools for my trade. I’ve long programmed to do scientific analysis but ScraperWiki is a company which sells software, and the discipline of writing software for others to use is different from writing software for yourself, particularly important are testing and source control.

I spoke at a couple of events: Data Science London, and Strata London where I gave an Ignite talk. Ignite talks follow a special format, they are five minutes long and you get 20 slides which advance automatically at the rate of one every 15 seconds – a somewhat frantic experience. My talk is captured on video.

I also did some bits of data analysis; #InspiringWomen was a look at a response to the online bullying and abuse of women. A place in the country was about data on house prices which we had collected for a campaign by Shelter.

Back on my own blog I managed to do a couple of photographic posts, one on Liverpool. The rail loop under Liverpool was closed which meant I had to walk across town to work, and I suddenly realised that Liverpool is rather spectacular architecturally. This led me on to getting the Pevsner Guide to Liverpool. The ScraperWiki office might be a bit unusual in that a quarter of the company owns this book! I also went on a business trip to Trento, which turns out to be a very attractive city, unfortunately I only had my phone with me to take photos.

The last year has highlighted to me what a privilege it was to have so much time to spend on my blogging, photography and garden shed fiddling in the past. It’s what got me my new job but for many, equally able, people this investment of time simply isn’t possible with the other responsibilities they have. Something to consider the next time you’re recruiting, and so highly rating that extra-curricular activity.

Also I realise I have a great deal of theoretical knowledge about a whole pile of technologies but I have spent rather less time on actually doing anything with them, so maybe this coming year there’ll be less reading and more coding on the train.

Happy New Year to you all!

review

Nov 01 2013

Book review: Tableau 8 – the official guide by George Peck

By SomeBeans in Book Reviews

tableau 8 guide This review was first published at ScraperWiki.

A while back I reviewed Larry Keller’s book The Tableau 8.0 Training Manual, at the same time I ordered George Peck’s book Tableau 8: the official guide. It’s just arrived. The book comes with a DVD containing bonus videos featuring George Peck’s warm, friendly tones and example workbooks. I must admit to being mildly nonplussed at receiving optical media, my ultrabook lacking an appropriate drive, but I dug out the USB optical drive to load them up. Providing an online link would have allowed the inclusion of up to date material, perhaps covering the version 8.1 announcement.

Tableau is a data visualisation application, aimed at the business intelligence area and optimised to look at database shaped data. I’m using Tableau on a lot of the larger datasets we get at ScraperWiki for sense checking and analysis.

Colleagues have noted that analysis in Tableau looks like me randomly poking buttons in the interface. From Peck’s book I learn that the order in which I carry out random clicking is important since Tableau will make a decision on what you want to see based both on what you have clicked and also its current state.

To my mind the heavy reliance on the graphical interface is one of the drawbacks of Tableau, but clearly, to business intelligence users and journalists, it’s the program’s greatest benefit. It’s a drawback because capturing what you’ve done in a GUI is tricky. Some of the scripting/version control capability is retained since most Tableau files are in plain XML format with which a little fiddling is tacitly approved by Tableau – although you won’t find such info in The Official Guide. I’ve been experimenting with using git source control on workbook files, and it works.

If you’re interested in these more advanced techniques then the Tableau Knowledgebase is worth a look. See this article, for example, on making a custom colour palette. I also like the Information Lab blog, 5 things I wish I knew about Tableau when I started and UK Area Polygon Mapping in Tableau. The second post covers one of the bug-bears for non-US users of Tableau: the mapping functionality is quite US-centric.

Peck covers most of the functionality of Tableau, including data connections, making visualisations, a detailed look at mapping, dashboards and so forth. I was somewhat bemused to see the scatter plot described as “esoteric”. This highlights the background of those typically using Tableau: business people not physical scientists, and not necessarily business people who understand database query languages. Hence the heavy reliance on a graphical user interface.

I particularly liked the chapters on data connections which also described the various set, group and combine operations. Finally I understand the difference between data blending and data joining: joining is done at source between tables on the same database whilst blending is done on data from different sources by Tableau, after it has been loaded. The end result is not really different.

I now understand the point of table calculations – they’re for the times when you can’t work out your SQL query. Peck uses different language from Tableau in describing table calculations. He uses “direction” to refer to the order in which cells are processed and “scope” to refer to the groups over which cell calculations are performed. Tableau uses the terms “addressing” and “partitioning” for these two concepts, respectively.

Peck isn’t very explicit about the deep connections between SQL and Tableau but makes sufficient mention of the underlying processes to be useful.

It was nice to see a brief, clear description of the options for publishing Tableau workbooks. Public is handy and free if you want to publish to all. Tableau Online presents a useful halfway house for internal publication whilst Tableau Server gives full flexibility in scheduling updates to data and publishing to a range of audiences with different permission levels. This is something we’re interested in at ScraperWiki.

The book ends with an Appendix of functions available for field calculations.

In some ways Larry Keller and George Peck’s books complement each other, Larry’s book (which I reviewed here) contains the examples that George’s lacks and George’s some of the more in depth discussion missing from Larry’s book.

Overall: a nicely produced book with high production values, good but not encyclopedic coverage.

data science, data visualisation, scraperwiki

Oct 29 2013

Git!

By SomeBeans in Technology

This post was first published at ScraperWiki.

As software company, use of some sort of software source control system is inevitable, indeed our CEO wrote TortoiseCVS – a file system overlay for the early CVS source control system. For those uninitiated in the joys of software engineering: source control is a system for recording the history of file revisions allowing programmers to edit their code, safe in the knowledge that they can always revert to a previous good state of code if it all goes horribly wrong. We use Git for source control, hosted either on Github or on Bitbucket. The differing needs of our platform and data services teams fit the payment plans of the two different sites.

Git is a distributed source control system created by Linus Torvalds, to support the development of Linux. Git is an incredibly flexible system which allows you to do pretty much anything. But what should you do? What should be your strategy for collective code development? It’s easy to look up a particular command to do a particular thing, but less is written on how you should string your git commands together. Here we hope to address this lack.

We use the “No Switch Yard” methodology, this involves creating branches from the master branch on which to develop new features and regularly rebasing against the master branch so that when the time comes the feature branch can be merged into the master branch via a pull request with little fuss. We should not be producing a byzantine system by branching feature branches from other feature branches. The aim of “No Switch Yard” is to make the history as simple as possible and make merging branches back onto master as easy as possible.

How do I start?

Assuming that you already have some code in a repository, create a local clone of that repository:

git clone git@github.com:scraperwiki/myproject.git

Create a branch:

git checkout -b my-new-stuff

Start coding…adding files and committing changes as you go:

git add -u

git commit -m "everything is great"

The -u switch to git add simply checks in all the tracked, uncommitted files. Depending on your levels of paranoia you can push your branch back to the remote repository:

git push

How do I understand what’s going on?

For me the key revelation for workflow was to be able to find out my current state and feel pleasure when it was good! To do this, fetch any changes that may have been made on your repository:

git fetch

and then run:

git log --oneline --graph --decorate --all

To see an ASCII art history diagram for your repository. What you are looking for here is a relatively simple branching structure without too many parallel tracks and with the tips of each branch lined up between your local and the remote copy.
You can make an alias to simplify this inspection:

git config --global alias.lg 'log --oneline --graph --decorate'

Then you can just do:

git lg --all

I know someone else has pushed to the master branch from which I branched – what should I do?

If stuff is going on on your master branch, perhaps because your changes are taking a while to complete, you should rebase. You should also do this just before submitting a pull request to merge your work with the master branch.

git rebase -i

Allows you to rebase interactively, this means you can combine multiple commits into a single larger commit. You might want to do this if you made lots of little commits whilst achieving a single goal. Rebasing brings you up to date with another branch, without actually merging your changes into that branch.

I’m done, how do I give my colleagues the opportunity to work on my great new features?

You need to rebase against the remote branch onto which you wish to merge your code and then submit a pull request for your changes. You can submit a pull request from the web interface at Github or Bitbucket. Or you can use a command line tool such as hub. The idea of using a pull request is that it makes your changes visible to your colleagues, and keeps a clear record of those changes. If you’ve been rebasing regularly you should be able to merge your code automatically.

An important principle here is “ownership”, in social terms you own your local branch on which you are developing a feature, so you can do what you like with it. The master branch from which you started work is in collective ownership so you should only merge changes onto it with the permission of your colleagues and ideally you want others to look at your changes and approve the pull themselves.

I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?

You need to stash your code, using:

git stash

Then create a branch, as described above, and then retrieve the contents of the stash:

git stash pop

That’s how we use git – what do you do?

data science, git, scraperwiki

Oct 23 2013

A place in the country

By SomeBeans in Technology

This post was first published at ScraperWiki.

Recently Shelter came to us asking for data on house prices across the UK to help them with some research in support of campaign on housing affordability.

This is a challenge we’re well suited to address, in fact a large fraction of the ScraperWiki team have scraped property price data for our own purposes. Usually though we just scrape a local area, using the Zoopla API, but Shelter wanted the whole country. It would be possible to do the whole country by this route but rate-limiting would mean it took a few days. So we spoke nicely to Zoopla who generously lifted the rate-limiting for us, they were also very helpfully in responding to our questions about their API.

The raw data for this job amounted to 2 gigabytes, 34 pieces of information for each of 500,000 properties for sale in the UK in August 2013. The data tell us about the location, the sale price, the property details, the estate agent details and the price history of each property.

As usual in these situations we fired up Tableau to get a look at the data, Tableau is well-suited to this type of database-table shaped data and is responsive for this number of lines of data.

What sort of properties are we looking at here?

We can find out this information from the “property type” field, shown in the chart below which counts the number of properties for sale in each property type category. The most common category is “Detached”, followed by “Flat”.

We can also look at the number of bedrooms. Unsurprisingly the number of bedrooms peaks at about 3 but with significant numbers of properties with 4, 5 and 6 bedrooms. Beyond that there are various guest houses, investment properties, parcels of land for sale with nominal numbers of bedrooms culminating in a 150 bedroom “property” which actually sounds like a village.

What about prices?

This is where things get really interesting. Below is a chart of the number of properties for sale in each price £25k price “bin”, for example the bin marked 475k contains all of the houses priced between £475k and £499,950 – the next bin being labelled 500k containing houses priced from £500k to £525k. We can see that the plot here is jagged, the numbers of properties for sale in each bin does not vary smoothly as the price increases, it jumps up and down. In fact this effect is quite regular, for houses priced over £500k there are fewest for sale at the round numbers £500k, £600k etc most for sale at £575k, £675k and so forth.

But this doesn’t just effect the super-wealthy – if we zoom into the lower priced region, making our price bins only £1k there is a similar effect with prices ending 4,5 and 9,0 more frequent than those ending 1, 2, 3 or 6, 7, 8. This is all the psychology of pricing.

Distribution of prices around the country?

We can get a biased view of the population distribution simply by plotting all the property for sale locations. ‘Biased’ because, at the very least, varying economic conditions around the country will bias the number of properties for sale.

This looks about right, there are voids in the areas of the country which are sparsely populated such as Scotland, Wales, the Peak District and the Lake District.

Finally, we can look at how prices vary around the country – the map below shows the average house price in a region defined by the “outcode” – the first group of letters in a UK postcode. The colour of the points indicates the average price – darkest blue for the lowest average price (£40k) and darkest red for the highest average price (£500k). The size of the dots shows how many properties are for sale in that area.

I’m grateful to be living in the relatively inexpensive North West of England!

There’s plenty more things to look at in this data, for example – the frequency of street names around the UK and the words used by estate agents to describe properties but that is for another day.

That’s what we found – what would you do?

scraperwiki

I've worked as a scientist for the last 30 years, at various universities, a large home and personal care company, a startup in Liverpool called The Sensible Code Company (formerly ScraperWiki Ltd), GBG and now as a consultant in data science.

I write about:
* the books I have read, typically science and history (or both), partly as a reminder to myself and partly as a review;
* science, things I have done or things I find interesting;
* technology, programming and gadgets;
politics, and current affairs;
* ...and other stuff as it takes my fancy - holidays, photographs and things I want to remember.

The BIG Lottery Data

Review of the year: 2013

Book review: Tableau 8 – the official guide by George Peck

Git!

How do I start?

How do I understand what’s going on?

I know someone else has pushed to the master branch from which I branched – what should I do?

I’m done, how do I give my colleagues the opportunity to work on my great new features?

I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?

A place in the country

What sort of properties are we looking at here?

What about prices?

Distribution of prices around the country?

About

Recent Posts

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers

The BIG Lottery Data

Review of the year: 2013

Book review: Tableau 8 – the official guide by George Peck

Git!

How do I start?

How do I understand what’s going on?

I know someone else has pushed to the master branch from which I branched – what should I do?

I’m done, how do I give my colleagues the opportunity to work on my great new features?

I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?

A place in the country

What sort of properties are we looking at here?

What about prices?

Distribution of prices around the country?

About

Recent Posts

Tags

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers