Book review: Python for Data Analysis by Wes McKinney

PythonForDataAnalysis_cover

This review was first published at ScraperWiki.

As well as developing scrapers and a data platform, at ScraperWiki we also do data analysis. Some of this is just because we’re interested, other times it’s because clients don’t have the tools or the time to do the analysis they want themselves. Often the problem is with the size of the data. Excel is the universal solvent for data analysis problems – go look at any survey of data scientists. But Excel has it’s limitations. There are the technical limitations of something like a million rows maximum size but well before this size Excel becomes a pain to use.

There is another path – the programming route. As a physical scientist of moderate age I’ve followed these two data analysis paths in parallel. Excel for the quick look see and some presentation. Programming for bigger tasks, tasks I want to do repeatedly and types of data Excel simply can’t handle – like image data. For me the programming path started with FORTRAN and the NAG libraries, from which I moved into Matlab. FORTRAN is pure, traditional programming born in the days when you had to light your own computing fire. Matlab and competitors like Mathematica, R and IDL follow a slightly different path. At their core they are specialist programming languages but they come embedded in graphical environments which can be used interactively. You type code at a prompt and stuff happens, plots pop up and so forth. You can capture this interaction and put it into scripts/programs, or simply write programs from scratch.

Outside the physical sciences, data analysis often means databases. Physical scientists are largely interested in numbers, other sciences and business analysts are often interested in a mixture of numbers and categorical things. For example, in analysing the performance of a drug you may be interested in the dose (i.e. a number) but also in categorical features of the patient such as gender and their symptoms. Databases, and analysis packages such as R and SAS are better suited to this type of data. Business analysts appear to move from Excel to Tableau as their data get bigger and more complex. Tableau gives easy visualisation of database shaped data. It provides connectors to many different databases. My workflow at ScraperWiki is often Python to SQL database to Tableau.

Python for Data Analysis by Wes McKinney draws these threads together. The book is partly about the range of tools which make Python an alternative to systems like R, Matlab and their ilk and partly a guide to McKinney’s own contribution to this area: the pandas library. Pandas brings R-like dataframes and database-like operations to Python. It helps keep all your data analysis needs in one big Python-y tent. Dataframes are 2-dimensional tables of data whose rows and columns have indexes which can be numeric but are typically text. The pandas library provides a great deal of functionality to process Dataframes, in particular enabling filtering and grouping calculations which are reminiscent of the SQL database workflow. The indexes can be hierarchical. As well as the 2-dimensional Dataframe, pandas also provides 1-dimensional Series and a 3-dimensional Panel data structures.

I’ve already been using pandas in the Python part of my workflow. It’s excellent for importing data, and simplifies the process of reshaping data for upload to a SQL database and onwards to visualisation in Tableau. I’m also finding it can be used to help replace some of the more exploratory analysis I do in Tableau and SQL.

Outside of pandas the key technologies McKinney introduces are the ipython interactive console and the NumPy library. I mentioned the ipython notebook in my previous book review. ipython gives Python the interactive analysis capabilities of systems like Matlab. The NumPy library is a high performance library providing simple multi-dimensional arrays, comforting those who grew up with a FORTRAN background.

Why switch from commercial offerings like Matlab to the Python ecosystem? Partly it’s cost, the pricing model for Matlab has a moderately expensive core (i.e. $1000) with further functionality in moderately expensive toolboxes (more $1000s). Furthermore, the most painful and complex thing I did at my previous (very large) employer was represent users in the contractual interactions between my company and Mathworks to license Matlab and its associated tool boxes for hundreds of employees spread across the globe. These days Python offers me a wider range of high quality toolboxes, at it’s core it’s a respectable programming language with all the features and tooling that brings. If my code doesn’t run it’s because I wrote it wrong, not because my colleague in Shanghai has grabbed the last remaining network license for a key toolbox. R still offers statistical analysis with greater gravitas and some really nice, publication quality plotting but it does not have the air of a general purpose programming language.

The parts of Python for Data Analysis which I found most interesting, and engaging, were the examples of pandas code in “live” usage. Early in the book this includes analysis of first names for babies in the US over time, with later examples from the financial sector – in which the author worked. Much of the rest is very heavy on showing code snippets which is distracting from a straightforward reading of the book.  In some senses Mining the Social Web has really spoiled me – I now expect a book like this to come with an Ipython Notebook!

Trainspotting

eurostar-logoI am something of a trainspotter.

That’s not to say I have ever stood at the end of a platform writing down the numbers of the trains that go by, rather that have an interest in things of a railway nature. So obviously I was very excited to get the opportunity to go to Paris on the train.

I’ve been to Paris on the train before. Fifteen years or so ago HappyMouffetard and I travelled from Cambridge to Paris for the odd weekend. In fact that’s where HappyMouffetard picked up her twitter handle. In those days the terminus for the Eurostar was at Waterloo, so the trip meant crossing London from Kings Cross where the Cambridge train came in. Once on the Eurostar you pottered through Kent to the Dover end of the tunnel at what seemed like barely more than walking pace. After passing through the tunnel to France the train accelerated for a while before the guard told us we were travelling at some unimaginable velocity. He sounded a bit smug. The Eurostar would then whine rapidly through northern France to arrive at Gare du Nord.

Things have changed. Now the Eurostar terminus is at St Pancras which is next door to Kings Cross and a short step down the road from Euston, the station I arrive at from Chester. St Pancras International is a rather fine station, particularly when compared to the competition: airports. Not only does it offer a long bank of charging points but also free Wifi! The trip to Dover is transformed, the train plunges underground for the first few miles but then whizzes along at positively unBritish speeds to the Channel. A little over two hours after leaving London, you are in Paris. Pick the right trains and there are just two scheduled stops between Chester and Paris (at Crewe and in London)!

This makes the whole journey rather more of a practical proposition, even if you are travelling from northern England. Chester to London is currently a little over two hours travel time, it would take me an hour and a bit to reach Manchester airport. Check-in for Eurostar is an hour or so, and then a couple of hours to Paris and you end up at Gare du Nord in the centre of Paris rather than Charles de Gaulle Airport – some distance away. Once at Gare du Nord you walk straight off the train onto the street. Similarly on my return trip, I walk straight off the train and I’m on the platform at Euston in 15 minutes.

I’ve rarely found airports relaxing, they seem hellholes of “duty-free” shopping, stressed travellers, over-crowding, bad food, building works to insert more shopping opportunities, suffused with baseline low-level dread that the implausibility of powered flight invokes. The only exceptions I’ve found are when I’ve been able to travel business class and take refuge in the business lounge. In fact, I’m not bothered about the business class flying experience – it’s the lounge I’d pay for myself! And once you’re on the plane it’s cheek by jowl with your fellow man, and air hostesses trying to force plastic food upon you, hand luggage woes as there is insufficient space for the hand luggage everyone now carries since you get gouged for hold luggage.

Cost wise things aren’t so happy, a train to London is expensive unless you travel “off peak”, a small window in the middle of the day and later in the evening.

In summary, from Chester to Paris:

Flying: 1 hour 30 minutes + 2hours check-in + 1 hour 30 minutes flight + 1 hour to Paris + airport hell = nasty 6 hours

Train: 2 hours + 1 hour check-in + 2 hours + shiny nice things = nice 5 hours

Book review: Mining the Social Web by Matthew A. Russell

mining_the_social_web_cover

This review was first published at ScraperWiki.

The twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.

In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.

The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!

Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.

I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.

The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.

Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.

Book review: Data Mining – Practical Machine Learning Tools and Techniques by Witten, Frank and Hall

datamining

This review was first published at ScraperWiki.

I’ve been doing more reading on machine learning, this time in the form of Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank and Mark A. Hall. This comes by recommendation of my academic colleagues on the Newsreader project, who rely heavily on machine learning techniques to do natural language processing.

Data mining is about finding structure in data, the algorithms for doing this are found in the field of machine learning. The classic example is Iris flower dataset. This dataset contains measurements of parts of a flower for three different species of Iris, the challenge is to build a system which classifies a flower to its species by its measurements. More practical examples are in the diagnosis of machine faults, credit assessment, detection of oil slicks, customer support analysis, marketing and sales.

Previously I’ve reviewed Machine Learning in Action by Peter Harrington. Data Mining is a somewhat different book. The core contents are quite similar: background to machine learning, evaluating your results and a run through the core algorithms. Machine Learning in Action is a pretty quick run through the field touching on many subjects, with toy demonstrations built from scratch in Python. Data Mining, running to almost 600 pages, is a much more thorough reference. There is a place for both types of book, even on the same bookshelf.

Data Mining is written by three members of the University of Waikato’s Computer Science Department and is based around the Weka machine learning system developed there. Weka is a complete framework, written in Java, which implements the algorithms described in the book as well as some others. Weka can be accessed via the command-line or using a GUI. As well as the machine learning algorithms there are systems for preparing data, evaluating and visualising results. A collection of well-known demonstration data sets are included. I’ve no reason to doubt the quality of the implementations in Weka, the GUI interface is functional, occasionally puzzling and not particularly slick. The book stands alone from the Weka framework but the framework provides a good playground to try out the techniques discussed in the book. Weka seems to be entirely suitable for conducting serious analysis. This approach is in contrast to the approach of Harrington in Machine Learning in Action who provides toy implementations of algorithms in Python.

The first two parts of the book provide an overview of machine learning, followed by a more detailed look at how the key algorithms are implemented. The third section is dedicated to Weka, whilst the first two sections refer to it but do not rely on it. The third section is divided into a discussion of Weka, covering all its key features and then a tutorial. I found this a bit confusing since the first part has the air of a tutorial, but isn’t, and the tutorial part keeps referring back to the overview section for its screenshots.

With some knowledge already in machine learning, the things I learned from this book:

  • better methods, and subtleties in measuring the performance of machine learning algorithms;
  • the success of the one-rule algorithm, essentially a decision tree which gets the maximum benefit from a single rule. It turns such an approach is surprisingly effective and only bettered a little, if at all, by more sophisticated algorithms;
  • getting enough, clean data to do machine learning is often a problem;
  • where to learn more!

The first edition of this book was published in 1999; my review is of the third edition. The book does show some signs of age, Machine Learning in Action  was written as a response to a poll published at the International Conference on Data Mining 2006 on the 10 most important machine learning algorithms (see the paper here). Whilst Data Mining mentions this survey, it is as something of an afterthought and the authors seem bemused by the inclusion of the PageRank algorithm used by Google to rank web pages in search results. They mention the Moa framework for data stream mining although do not discuss it in any detail. Moa focuses on techniques for large datasets.

In summary: a well-written, well-structured and readable book on machine learning algorithms with demonstrations based on an extensive machine learning framework. Definitely one to read and come back to for reference.

The BIG Lottery Data

uklogo

This post was originally published at ScraperWiki.

The UK’s BIG Lottery Fund recently released its grant data since 2004 as a set of lovely CSV files: You can get it yourself here or here. I found it a great opportunity to try out some new tricks with Tableau, and have a bit of a poke around another largish dataset from government. The data runs to a little under 120,000 lines.

The first question to ask is: where is all the money going?

The total awarded is £5,277,058,180 over nearly 10 years. It’s going to 81,386 different organisations. The sizes of grants vary enormously; the biggest, £214,340,846, going to the Big Local Trust, which is an umbrella organisation. Other big recipients include the Royal Society of Wildlife Trusts, who received £59,842,400 for the Local Food programme. The top 10 grants are listed below:

01/03/2012, Big Local Trust  £        214,340,846
15/08/2007, Royal Society of Wildlife Trusts  £          59,842,400
04/10/2007, The Federation of Groundwork Trusts  £          58,306,400
13/05/2008, Sustrans Limited  £          49,980,908
11/10/2012, Life Changes (Trustee) Limited  £          49,338,186
13/12/2011, Forces In Mind Trustee Limited  £          34,808,423
19/10/2007, Natural England  £          30,113,200
01/05/2007, Legacy Trust UK Limited  £          28,850,000
31/07/2007, Sustrans Limited  £          25,023,084
09/04/2008, Falkirk Council  £          25,000,000

Awards like this make determining the true geographic distribution of grants a bit tricky, since they are registered as being awarded to a particular local area – apparently the head office of the applicant – but they are used nationally. There is a regional breakdown of where the money is spent but this classification is to large areas i.e. “England” or “North West”. The Big Local Trust, Life Changes and Forces in Mind are all very recently established – less than a couple of years old. The Legacy Trust was established in 2007 to fund programmes to promote an Olympic legacy.

These are really big grants, but what does the overall distribution of awards look like?

This is shown in the chart below:

Award distribution

It’s a bit complicated because the spread of award sizes is from about £1000 to over £100,000,000 so what I’ve done is taken the logarithm of the award to create the bins. This means that the column marked “3” contains the sum of all awards from £1000 to £9999 and that marked “4” contains the sum of all awards from £10,000 to £99,999. The chart shows that most money is distributed in the column marked “5”, i.e. £100,000 to £999,999. The columns are coloured by the year in which money was awarded, so we can see that there were large grants awarded in 2007 as well as 2011 and 2013.

Everybody loves a word cloud, even though we know it’s not good in terms of data visualisation, a simple bar chart shows the relative frequency of words more clearly. The word cloud below shows the frequency of words appearing in the applicant name field of the data, lots of money going to Communities, Schools, Clubs and councils.

image

The data also include the founding date for the organisations to which money is awarded, most of them were founded since the beginning of the 20th century. There are quite a few schools and local councils in the list and, particularly for councils we can see the effect of legislation on the foundation of these organisations, there are big peaks in founding dates for councils in 1894 and in 1972-1974, coinciding with a couple of local government acts. There’s a dip in the foundation of bodies funded by the BIG lottery for both the First and Second World Wars, I guess people’s energies were directed elsewhere. The National Lottery started in the UK in late 1994.

Founding year

As a final piece of analysis I thought I’d look at sport; I’m not particularly interested in sport so I let natural language processing find sports for me in applicant names – they are often of the form “Somewhere Cricket/Rugby/Tennis/etc Club”. One way of picking out all the sports awards would be to come up with a list of sports names and compare against that list but I applied a little more cunning: the nltk library will tell you how closely related two words are using the WordNet lexicon which it contains. So I identified sports by measuring how closely related a target word was from the word “sport”. This got off to a shaky start  since I decided to use “cricket” as a test word; “cricket” is as closely related to “sport” as “hamster” – a puzzling result until I realised that the first definition of “cricket” in WordNet relates to the insect! This confusion dispensed with finding all the sports mentioned in the applicant names was an easy task. The list of sports I ended up with was unexceptional.

You can find participation levels in various sports here, I plotted them together with numbers of awards. Sports near the top left have relatively few awards given the number of participants, whilst those bottom right have more awards than would be expected from the number of participants.

 

Number of clubs vs number of participants

You can see interactive versions of these plots, plus a view more here on Tableau Public.

That’s what I found in the data – what would interest you?

Footnotes

I uploaded the CSV files to a MySQL database before loading into Tableau, I also did a bit of work in Python using the pandas library. In addition to the BIG lottery data I pulled in census data from the ONS and geographic boundary data from Tableau Mapping. You can see all this unfolding on the bitbucket repo I set up to store the analysis. Since Tableau workbook files are XML format they can be usefully stored in source control.