Category: Technology

Programming, gadgets (reviews thereof) and computers

The BIG Lottery Data

uklogo

This post was originally published at ScraperWiki.

The UK’s BIG Lottery Fund recently released its grant data since 2004 as a set of lovely CSV files: You can get it yourself here or here. I found it a great opportunity to try out some new tricks with Tableau, and have a bit of a poke around another largish dataset from government. The data runs to a little under 120,000 lines.

The first question to ask is: where is all the money going?

The total awarded is £5,277,058,180 over nearly 10 years. It’s going to 81,386 different organisations. The sizes of grants vary enormously; the biggest, £214,340,846, going to the Big Local Trust, which is an umbrella organisation. Other big recipients include the Royal Society of Wildlife Trusts, who received £59,842,400 for the Local Food programme. The top 10 grants are listed below:

01/03/2012, Big Local Trust  £        214,340,846
15/08/2007, Royal Society of Wildlife Trusts  £          59,842,400
04/10/2007, The Federation of Groundwork Trusts  £          58,306,400
13/05/2008, Sustrans Limited  £          49,980,908
11/10/2012, Life Changes (Trustee) Limited  £          49,338,186
13/12/2011, Forces In Mind Trustee Limited  £          34,808,423
19/10/2007, Natural England  £          30,113,200
01/05/2007, Legacy Trust UK Limited  £          28,850,000
31/07/2007, Sustrans Limited  £          25,023,084
09/04/2008, Falkirk Council  £          25,000,000

Awards like this make determining the true geographic distribution of grants a bit tricky, since they are registered as being awarded to a particular local area – apparently the head office of the applicant – but they are used nationally. There is a regional breakdown of where the money is spent but this classification is to large areas i.e. “England” or “North West”. The Big Local Trust, Life Changes and Forces in Mind are all very recently established – less than a couple of years old. The Legacy Trust was established in 2007 to fund programmes to promote an Olympic legacy.

These are really big grants, but what does the overall distribution of awards look like?

This is shown in the chart below:

Award distribution

It’s a bit complicated because the spread of award sizes is from about £1000 to over £100,000,000 so what I’ve done is taken the logarithm of the award to create the bins. This means that the column marked “3” contains the sum of all awards from £1000 to £9999 and that marked “4” contains the sum of all awards from £10,000 to £99,999. The chart shows that most money is distributed in the column marked “5”, i.e. £100,000 to £999,999. The columns are coloured by the year in which money was awarded, so we can see that there were large grants awarded in 2007 as well as 2011 and 2013.

Everybody loves a word cloud, even though we know it’s not good in terms of data visualisation, a simple bar chart shows the relative frequency of words more clearly. The word cloud below shows the frequency of words appearing in the applicant name field of the data, lots of money going to Communities, Schools, Clubs and councils.

image

The data also include the founding date for the organisations to which money is awarded, most of them were founded since the beginning of the 20th century. There are quite a few schools and local councils in the list and, particularly for councils we can see the effect of legislation on the foundation of these organisations, there are big peaks in founding dates for councils in 1894 and in 1972-1974, coinciding with a couple of local government acts. There’s a dip in the foundation of bodies funded by the BIG lottery for both the First and Second World Wars, I guess people’s energies were directed elsewhere. The National Lottery started in the UK in late 1994.

Founding year

As a final piece of analysis I thought I’d look at sport; I’m not particularly interested in sport so I let natural language processing find sports for me in applicant names – they are often of the form “Somewhere Cricket/Rugby/Tennis/etc Club”. One way of picking out all the sports awards would be to come up with a list of sports names and compare against that list but I applied a little more cunning: the nltk library will tell you how closely related two words are using the WordNet lexicon which it contains. So I identified sports by measuring how closely related a target word was from the word “sport”. This got off to a shaky start  since I decided to use “cricket” as a test word; “cricket” is as closely related to “sport” as “hamster” – a puzzling result until I realised that the first definition of “cricket” in WordNet relates to the insect! This confusion dispensed with finding all the sports mentioned in the applicant names was an easy task. The list of sports I ended up with was unexceptional.

You can find participation levels in various sports here, I plotted them together with numbers of awards. Sports near the top left have relatively few awards given the number of participants, whilst those bottom right have more awards than would be expected from the number of participants.

 

Number of clubs vs number of participants

You can see interactive versions of these plots, plus a view more here on Tableau Public.

That’s what I found in the data – what would interest you?

Footnotes

I uploaded the CSV files to a MySQL database before loading into Tableau, I also did a bit of work in Python using the pandas library. In addition to the BIG lottery data I pulled in census data from the ONS and geographic boundary data from Tableau Mapping. You can see all this unfolding on the bitbucket repo I set up to store the analysis. Since Tableau workbook files are XML format they can be usefully stored in source control.

Git!

logo@2x

This post was first published at ScraperWiki.

As software company, use of some sort of software source control system is inevitable, indeed our CEO wrote TortoiseCVS – a file system overlay for the early CVS source control system. For those uninitiated in the joys of software engineering: source control is a system for recording the history of file revisions allowing programmers to edit their code, safe in the knowledge that they can always revert to a previous good state of code if it all goes horribly wrong. We use Git for source control, hosted either on Github or on Bitbucket. The differing needs of our platform and data services teams fit the payment plans of the two different sites.

Git is a distributed source control system created by Linus Torvalds, to support the development of Linux. Git is an incredibly flexible system which allows you to do pretty much anything. But what should you do? What should be your strategy for collective code development? It’s easy to look up a particular command to do a particular thing, but less is written on how you should string your git commands together. Here we hope to address this lack.

We use the “No Switch Yard” methodology, this involves creating branches from the master branch on which to develop new features and regularly rebasing against the master branch so that when the time comes the feature branch can be merged into the master branch via a pull request with little fuss. We should not be producing a byzantine system by branching feature branches from other feature branches. The aim of “No Switch Yard” is to make the history as simple as possible and make merging branches back onto master as easy as possible.

How do I start?

Assuming that you already have some code in a repository, create a local clone of that repository:

git clone [email protected]:scraperwiki/myproject.git

Create a branch:

git checkout -b my-new-stuff

Start coding…adding files and committing changes as you go:

git add -u
git commit -m "everything is great"

The -u switch to git add simply checks in all the tracked, uncommitted files. Depending on your levels of paranoia you can push your branch back to the remote repository:

git push

How do I understand what’s going on?

For me the key revelation for workflow was to be able to find out my current state and feel pleasure when it was good! To do this, fetch any changes that may have been made on your repository:

git fetch

and then run:

git log --oneline --graph --decorate --all

To see an ASCII art history diagram for your repository. What you are looking for here is a relatively simple branching structure without too many parallel tracks and with the tips of each branch lined up between your local and the remote copy.
You can make an alias to simplify this inspection:

git config --global alias.lg 'log --oneline --graph --decorate'

Then you can just do:

git lg --all

I know someone else has pushed to the master branch from which I branched – what should I do?

If stuff is going on on your master branch, perhaps because your changes are taking a while to complete, you should rebase. You should also do this just before submitting a pull request to merge your work with the master branch.

git rebase -i

Allows you to rebase interactively, this means you can combine multiple commits into a single larger commit. You might want to do this if you made lots of little commits whilst achieving a single goal. Rebasing brings you up to date with another branch, without actually merging your changes into that branch.

I’m done, how do I give my colleagues the opportunity to work on my great new features?

You need to rebase against the remote branch onto which you wish to merge your code and then submit a pull request for your changes. You can submit a pull request from the web interface at Github or Bitbucket. Or you can use a command line tool such as hub.  The idea of using a pull request is that it makes your changes visible to your colleagues, and keeps a clear record of those changes. If you’ve been rebasing regularly you should be able to merge your code automatically.

An important principle here is “ownership”, in social terms you own your local branch on which you are developing a feature, so you can do what you like with it. The master branch from which you started work is in collective ownership so you should only merge changes onto it with the permission of your colleagues and ideally you want others to look at your changes and approve the pull themselves.

I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?

You need to stash your code, using:

git stash

Then create a branch, as described above, and then retrieve the contents of the stash:

git stash pop

That’s how we use git – what do you do?

A place in the country

This post was first published at ScraperWiki.

Recently Shelter came to us asking for data on house prices across the UK to help them with some research in support of campaign on housing affordability.

This is a challenge we’re well suited to address, in fact a large fraction of the ScraperWiki team have scraped property price data for our own purposes. Usually though we just scrape a local area, using the Zoopla API, but Shelter wanted the whole country. It would be possible to do the whole country by this route but rate-limiting would mean it took a few days. So we spoke nicely to Zoopla who generously lifted the rate-limiting for us, they were also very helpfully in responding to our questions about their API.

The raw data for this job amounted to 2 gigabytes, 34 pieces of information for each of 500,000 properties for sale in the UK in August 2013. The data tell us about the location, the sale price, the property details, the estate agent details and the price history of each property.

As usual in these situations we fired up Tableau to get a look at the data, Tableau is well-suited to this type of database-table shaped data and is responsive for this number of lines of data.

What sort of properties are we looking at here?

We can find out this information from the “property type” field, shown in the chart below which counts the number of properties for sale in each property type category. The most common category is “Detached”, followed by “Flat”.

Property_Type

We can also look at the number of bedrooms. Unsurprisingly the number of bedrooms peaks at about 3 but with significant numbers of properties with 4, 5 and 6 bedrooms. Beyond that there are various guest houses, investment properties, parcels of land for sale with nominal numbers of bedrooms culminating in a 150 bedroom “property” which actually sounds like a village.

What about prices?

This is where things get really interesting. Below is a chart of the number of properties for sale in each price £25k price “bin”, for example the bin marked 475k contains all of the houses priced between £475k and £499,950 – the next bin being labelled 500k containing houses priced from £500k to £525k. We can see that the plot here is jagged, the numbers of properties for sale in each bin does not vary smoothly as the price increases, it jumps up and down. In fact this effect is quite regular, for houses priced over £500k there are fewest for sale at the round numbers £500k, £600k etc most for sale at £575k, £675k and so forth.

PriceHistogram_25k

But this doesn’t just effect the super-wealthy – if we zoom into the lower priced region, making our price bins only £1k there is a similar effect with prices ending 4,5 and 9,0 more frequent than those ending 1, 2, 3 or 6, 7, 8. This is all the psychology of pricing.

Distribution of prices around the country?

We can get a biased view of the population distribution simply by plotting all the property for sale locations. ‘Biased’ because, at the very least, varying economic conditions around the country will bias the number of properties for sale.

Density_map_crop

This looks about right, there are voids in the areas of the country which are sparsely populated such as Scotland, Wales, the Peak District and the Lake District.

Finally, we can look at how prices vary around the country – the map below shows the average house price in a region defined by the “outcode” – the first group of letters in a UK postcode. The colour of the points indicates the average price – darkest blue for the lowest average price (£40k) and darkest red for the highest average price (£500k). The size of the dots shows how many properties are for sale in that area.

House Prices by UK Outcode

I’m grateful to be living in the relatively inexpensive North West of England!

There’s plenty more things to look at in this data, for example – the frequency of street names around the UK and the words used by estate agents to describe properties but that is for another day.

That’s what we found – what would you do?

Property information powered by Zoopla

Making a ScraperWiki view with R

 

This post was first published at ScraperWiki.

In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.

Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.

Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:

R-view

The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.

The code can all be found in this BitBucket repository.

The viewer is based on the knitr package for R, which generates reports in specified formats (HTML, PDF etc) from a source template file which contains R commands which are executed to generate content. In this case we use Rhtml, rather than the alternative Markdown, which enables us to specify custom CSS and JavaScript to integrate with the ScraperWiki platform.

ScraperWiki tools live in their own UNIX accounts called “boxes”, the code for the tool lives in a subdirectory, ~/tool, and web content in the ~/http directory is displayed. In this project the http directory contains a short JavaScript file, code.js, which by the magic of jQuery and some messy bash shell commands, puts the URL of the SQL endpoint into a file in the box. It also runs a package installation script once after the tool is first installed, the only package not already installed is the ggplot2 package.


function save_api_stub(){
scraperwiki.exec('echo "' + scraperwiki.readSettings().target.url + '" > ~/tool/dataset_url.txt; ')
}
function run_once_install_packages(){
scraperwiki.exec('run-one tool/runonce.R &> tool/log.txt &')
}
$(function(){
save_api_stub();
run_once_install_packages();
});

view raw

code.js

hosted with ❤ by GitHub

The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.

This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.


#!/usr/bin/Rscript
# Script to knit a file 2013-08-08
# Ian Hopkinson
library(knitr)
.libPaths('/home/tool/R/libraries')
render_html()
knit("/home/tool/view.Rhtml",output="/home/tool/http/index.html")

view raw

knitrview.R

hosted with ❤ by GitHub

Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:

  • a function to read the SQL endpoint URL which is dumped into the box by some JavaScript used in the Rhtml template.
  • a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
  • a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.

Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library. 


#!/usr/bin/Rscript
# Script to create r-view 2013-08-14
# Ian Hopkinson
source('scraperwiki_utils.R')
NumberOfTweets<-function(){
query = 'select count(*) from tweets'
number = ScraperWikiSQL(query)
return(number)
}
TweetsHistogram<-function(){
library("ggplot2")
library("scales")
#threshold = 20
bin = 60 # Size of the time bins in seconds
query = 'select created_at from tweets order by created_at limit 40000'
dates_raw = ScraperWikiSQL(query)
posix = strptime(dates_raw$created_at, "%Y-%m-%d %H:%M:%S+00:00")
num = as.POSIXct(posix)
Dates = data.frame(num)
p = qplot(num, data = Dates, binwidth = bin)
# This gets us out the histogram count values
counts = ggplot_build(p)$data[[1]]$count
timeticks = ggplot_build(p)$data[[1]]$x
# Calculate limits, method 1 – simple min and max of range
start = min(num)
finish = max(num)
minor = waiver() # Default breaks
major = waiver()
p = p+scale_x_datetime(limits = c(start, finish ),
breaks = major, minor_breaks = minor)
p = p + theme_bw() + xlab(NULL) + theme(axis.text.x = element_text(angle=45,
hjust = 1,
vjust = 1))
p = p + xlab('Date') + ylab('Tweets per minute') + ggtitle('Tweets per minute (Limited to 40000 tweets in total)')
return(p)
}

view raw

view-source.R

hosted with ❤ by GitHub

So there you go – not the world’s most exciting tool but it shows the way to make live reports on the ScraperWiki platform using R. Extensions to this would be to allow some user interaction, for example by allowing them to adjust the axis limits. This could be done either using JavaScript and vanilla R or using Shiny.

What would you do with R in ScraperWiki? Let me know in the comments below or by email: [email protected]

pdftables – a Python library for getting tables out of PDF files

This post was first published at ScraperWiki.

One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time.

PDF is a page description format, it has no knowledge of the logical structure of a document such as where titles are, or paragraphs, or whether it’s two column format or one column. It just knows where characters are on the page. The plot below shows how characters are laid out for a large table in a PDF file.

AlmondBoard7_LTChar

This makes extracting structured data from PDF a little challenging.

Don’t get me wrong, PDF is a useful format in the right place, if someone sends me a CV – I expect to get it in PDF because it’s a read only format. Send it in Microsoft Word format and the implication is that I can edit it – which makes no sense.

I’ve been parsing PDF files for a few years now, to start with using simple online PDF to text converters, then with pdftohtml which gave me better location data for text and now using the Python pdfminer library which extracts non-text elements and as well as bonding words into sentences and coherent blocks. This classification is shown in the plot below, the blue boxes show where pdfminer has joined characters together to make text boxes (which may be words or sentences). The red boxes show lines and rectangles (i.e. non-text elements).

AlmondBoard7

More widely at ScraperWiki we’ve been processing PDF since our inception with the tools I’ve described above and also the commercial, Abbyy software.

As well as processing text documents such as parliamentary proceedings, we’re also interested in tables of numbers. This is where the pdftables library comes in, we’re working towards making scrapers which are indifferent to the format in which a table is stored, receiving them via the OKFN messytables library which takes adapters to different file types. We’ve already added support to messytables for HTML, now its time for PDF support using our new, version-much-less-than-one pdftables library.

Amongst the alternatives to our own efforts are Mozilla’s Tabula, written in Ruby and requiring the user to draw around the target table, and Abbyy’s software which is commercial rather than open source.

pdftables can take a file handle and tell you which pages have tables on them, it can extract the contents of a specified page as a single table and by extension it can return all of the tables of a document (at the rate of one per page). It’s possible, for simple tables to do this with no parameters but for more difficult layouts it currently takes hints in the form of words found on the top and bottom rows of the table you are looking for. The tables are returned as a list of list of lists of strings, along with a diagnostic object which you can use to make plots. If you’re using the messytables library you just get back a tableset object.

It turns out the defining characteristic of a data scientist is that I plot things at the drop of a hat, I want to see the data I’m handling. And so it is with the development of the pdftables algorithms. The method used is inspired by image analysis algorithms, similar to the Hough transforms used in Tabula. A Hough transform will find arbitrarily oriented lines in an image but our problem is a little simpler – we’re interested in vertical and horizontal rows.

To find these vertical rows and columns we project the bounding boxes of the text on a page onto the horizontal axis ( to find the columns) and the vertical axis to find the rows. By projection we mean counting up the number of text elements along a given horizontal or vertical line. The row and column boundaries are marked by low values, gullies, in the plot of the projection. The rows and columns of the table form high mountains, you can see this clearly in the plot below. Here we are looking at the PDF page at the level of individual characters, the plots at the top and left show the projections. The black dots show where pdftables has placed the row and column boundaries.

AlmondBoard8_projection

pdftables is currently useful for supervised use but not so good if you want to just throw PDF files at it. You can find pdftables on Github and you can see the functionality we are still working on in the issue tracker. Top priorities are finding more than one table on a page and identifying multi-column text layouts to help with this process.

You’re invited to have a play and tell us what you think – [email protected]