Dr Administrator

Author's posts

Book review: Data Mining – Practical Machine Learning Tools and Techniques by Witten, Frank and Hall

datamining

This review was first published at ScraperWiki.

I’ve been doing more reading on machine learning, this time in the form of Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank and Mark A. Hall. This comes by recommendation of my academic colleagues on the Newsreader project, who rely heavily on machine learning techniques to do natural language processing.

Data mining is about finding structure in data, the algorithms for doing this are found in the field of machine learning. The classic example is Iris flower dataset. This dataset contains measurements of parts of a flower for three different species of Iris, the challenge is to build a system which classifies a flower to its species by its measurements. More practical examples are in the diagnosis of machine faults, credit assessment, detection of oil slicks, customer support analysis, marketing and sales.

Previously I’ve reviewed Machine Learning in Action by Peter Harrington. Data Mining is a somewhat different book. The core contents are quite similar: background to machine learning, evaluating your results and a run through the core algorithms. Machine Learning in Action is a pretty quick run through the field touching on many subjects, with toy demonstrations built from scratch in Python. Data Mining, running to almost 600 pages, is a much more thorough reference. There is a place for both types of book, even on the same bookshelf.

Data Mining is written by three members of the University of Waikato’s Computer Science Department and is based around the Weka machine learning system developed there. Weka is a complete framework, written in Java, which implements the algorithms described in the book as well as some others. Weka can be accessed via the command-line or using a GUI. As well as the machine learning algorithms there are systems for preparing data, evaluating and visualising results. A collection of well-known demonstration data sets are included. I’ve no reason to doubt the quality of the implementations in Weka, the GUI interface is functional, occasionally puzzling and not particularly slick. The book stands alone from the Weka framework but the framework provides a good playground to try out the techniques discussed in the book. Weka seems to be entirely suitable for conducting serious analysis. This approach is in contrast to the approach of Harrington in Machine Learning in Action who provides toy implementations of algorithms in Python.

The first two parts of the book provide an overview of machine learning, followed by a more detailed look at how the key algorithms are implemented. The third section is dedicated to Weka, whilst the first two sections refer to it but do not rely on it. The third section is divided into a discussion of Weka, covering all its key features and then a tutorial. I found this a bit confusing since the first part has the air of a tutorial, but isn’t, and the tutorial part keeps referring back to the overview section for its screenshots.

With some knowledge already in machine learning, the things I learned from this book:

  • better methods, and subtleties in measuring the performance of machine learning algorithms;
  • the success of the one-rule algorithm, essentially a decision tree which gets the maximum benefit from a single rule. It turns such an approach is surprisingly effective and only bettered a little, if at all, by more sophisticated algorithms;
  • getting enough, clean data to do machine learning is often a problem;
  • where to learn more!

The first edition of this book was published in 1999; my review is of the third edition. The book does show some signs of age, Machine Learning in Action  was written as a response to a poll published at the International Conference on Data Mining 2006 on the 10 most important machine learning algorithms (see the paper here). Whilst Data Mining mentions this survey, it is as something of an afterthought and the authors seem bemused by the inclusion of the PageRank algorithm used by Google to rank web pages in search results. They mention the Moa framework for data stream mining although do not discuss it in any detail. Moa focuses on techniques for large datasets.

In summary: a well-written, well-structured and readable book on machine learning algorithms with demonstrations based on an extensive machine learning framework. Definitely one to read and come back to for reference.

The BIG Lottery Data

uklogo

This post was originally published at ScraperWiki.

The UK’s BIG Lottery Fund recently released its grant data since 2004 as a set of lovely CSV files: You can get it yourself here or here. I found it a great opportunity to try out some new tricks with Tableau, and have a bit of a poke around another largish dataset from government. The data runs to a little under 120,000 lines.

The first question to ask is: where is all the money going?

The total awarded is £5,277,058,180 over nearly 10 years. It’s going to 81,386 different organisations. The sizes of grants vary enormously; the biggest, £214,340,846, going to the Big Local Trust, which is an umbrella organisation. Other big recipients include the Royal Society of Wildlife Trusts, who received £59,842,400 for the Local Food programme. The top 10 grants are listed below:

01/03/2012, Big Local Trust  £        214,340,846
15/08/2007, Royal Society of Wildlife Trusts  £          59,842,400
04/10/2007, The Federation of Groundwork Trusts  £          58,306,400
13/05/2008, Sustrans Limited  £          49,980,908
11/10/2012, Life Changes (Trustee) Limited  £          49,338,186
13/12/2011, Forces In Mind Trustee Limited  £          34,808,423
19/10/2007, Natural England  £          30,113,200
01/05/2007, Legacy Trust UK Limited  £          28,850,000
31/07/2007, Sustrans Limited  £          25,023,084
09/04/2008, Falkirk Council  £          25,000,000

Awards like this make determining the true geographic distribution of grants a bit tricky, since they are registered as being awarded to a particular local area – apparently the head office of the applicant – but they are used nationally. There is a regional breakdown of where the money is spent but this classification is to large areas i.e. “England” or “North West”. The Big Local Trust, Life Changes and Forces in Mind are all very recently established – less than a couple of years old. The Legacy Trust was established in 2007 to fund programmes to promote an Olympic legacy.

These are really big grants, but what does the overall distribution of awards look like?

This is shown in the chart below:

Award distribution

It’s a bit complicated because the spread of award sizes is from about £1000 to over £100,000,000 so what I’ve done is taken the logarithm of the award to create the bins. This means that the column marked “3” contains the sum of all awards from £1000 to £9999 and that marked “4” contains the sum of all awards from £10,000 to £99,999. The chart shows that most money is distributed in the column marked “5”, i.e. £100,000 to £999,999. The columns are coloured by the year in which money was awarded, so we can see that there were large grants awarded in 2007 as well as 2011 and 2013.

Everybody loves a word cloud, even though we know it’s not good in terms of data visualisation, a simple bar chart shows the relative frequency of words more clearly. The word cloud below shows the frequency of words appearing in the applicant name field of the data, lots of money going to Communities, Schools, Clubs and councils.

image

The data also include the founding date for the organisations to which money is awarded, most of them were founded since the beginning of the 20th century. There are quite a few schools and local councils in the list and, particularly for councils we can see the effect of legislation on the foundation of these organisations, there are big peaks in founding dates for councils in 1894 and in 1972-1974, coinciding with a couple of local government acts. There’s a dip in the foundation of bodies funded by the BIG lottery for both the First and Second World Wars, I guess people’s energies were directed elsewhere. The National Lottery started in the UK in late 1994.

Founding year

As a final piece of analysis I thought I’d look at sport; I’m not particularly interested in sport so I let natural language processing find sports for me in applicant names – they are often of the form “Somewhere Cricket/Rugby/Tennis/etc Club”. One way of picking out all the sports awards would be to come up with a list of sports names and compare against that list but I applied a little more cunning: the nltk library will tell you how closely related two words are using the WordNet lexicon which it contains. So I identified sports by measuring how closely related a target word was from the word “sport”. This got off to a shaky start  since I decided to use “cricket” as a test word; “cricket” is as closely related to “sport” as “hamster” – a puzzling result until I realised that the first definition of “cricket” in WordNet relates to the insect! This confusion dispensed with finding all the sports mentioned in the applicant names was an easy task. The list of sports I ended up with was unexceptional.

You can find participation levels in various sports here, I plotted them together with numbers of awards. Sports near the top left have relatively few awards given the number of participants, whilst those bottom right have more awards than would be expected from the number of participants.

 

Number of clubs vs number of participants

You can see interactive versions of these plots, plus a view more here on Tableau Public.

That’s what I found in the data – what would interest you?

Footnotes

I uploaded the CSV files to a MySQL database before loading into Tableau, I also did a bit of work in Python using the pandas library. In addition to the BIG lottery data I pulled in census data from the ONS and geographic boundary data from Tableau Mapping. You can see all this unfolding on the bitbucket repo I set up to store the analysis. Since Tableau workbook files are XML format they can be usefully stored in source control.

Review of the year: 2013

Liverpool Metropolitan CathedralMy blogging is much reduced this year, at least on my own blog. This is a result of my new job with ScraperWiki and child care, Thomas is now nearly two years old.

I started the year with a couple of posts on my shiny new laptop; working for a startup I’ve escaped from the corporate Dell. One post was on the beast itself – a Sony VAIO, and Windows 8 – Microsoft’s somewhat confusing new operating system offering. The other post was on running Ubuntu on the VAIO. In the past this was a case of setting up dual boot but various innovations make this difficult and there is, in my view, a better solution: a virtual machine.

There wasn’t much ranting this year: I only managed one little one about higher education, and the reluctance amongst lecturers to take any teaching qualifications. The only other marginally opinion piece was on electronic books, where I muttered about DRM limiting the functionality of ebooks.

I managed to read a few books which ended up on my own blog: The Eighth Day of Creation, about the unravelling of the genetic code was a dense, heroic read. The Dinosaur Hunters was light and fluffy. Empire of the Clouds and The Backroom Boys were largely wistful rememberings of Britain’s former greatness in jet aeroplanes and in technology more generally. Chasing Venus and a History of the World in 12 Maps returned to the themes of geodesy and mapping which I’ve explored in the past. Finally, a bit of London history with The Subterranean Railway and Lucy Inglis’ Georgian London. I’ve been following Lucy on twitter since Georgian London was a twinkle in her eye. It’s difficult to choose a favourite amongst these, it’s either History of the World in 12 Maps or Georgian London.

Over on ScraperWiki’s blog I’ve been knocking out blog posts at a great rate, you can see them all here. I did a good deal of book reviewing over there too, my commute into work on the train means I get an hour or so of reading every day – which quickly adds up to a lot of reading! I read about machine learning, data visualisation (this and this), Tableau (this and this), natural language processing, R, Javascript and software engineering. I’m currently ploughing my way through Data Mining: Practical Machine Learning Tools and Techniques. I think my favourite of these was Natural Language Processing with Python. I’m beginning to see the value of the more expensive, better established publishing houses in terms of book quality.

Alongside this I did a few blog posts on new tools for my trade. I’ve long programmed to do scientific analysis but ScraperWiki is a company which sells software, and the discipline of writing software for others to use is different from writing software for yourself, particularly important are testing and source control.

I spoke at a couple of events: Data Science London, and Strata London where I gave an Ignite talk. Ignite talks follow a special format, they are five minutes long and you get 20 slides which advance automatically at the rate of one every 15 seconds – a somewhat frantic experience. My talk is captured on video.

I also did some bits of data analysis; #InspiringWomen was a look at a response to the online bullying and abuse of women. A place in the country was about data on house prices which we had collected for a campaign by Shelter.

Back on my own blog I managed to do a couple of photographic posts, one on Liverpool. The rail loop under Liverpool was closed which meant I had to walk across town to work, and I suddenly realised that Liverpool is rather spectacular architecturally. This led me on to getting the Pevsner Guide to Liverpool. The ScraperWiki office might be a bit unusual in that a quarter of the company owns this book! I also went on a business trip to Trento, which turns out to be a very attractive city, unfortunately I only had my phone with me to take photos.

The last year has highlighted to me what a privilege it was to have so much time to spend on my blogging, photography and garden shed fiddling in the past. It’s what got me my new job but for many, equally able, people this investment of time simply isn’t possible with the other responsibilities they have. Something to consider the next time you’re recruiting, and so highly rating that extra-curricular activity.

Also I realise I have a great deal of theoretical knowledge about a whole pile of technologies but I have spent rather less time on actually doing anything with them, so maybe this coming year there’ll be less reading and more coding on the train.

Happy New Year to you all!

Book review: Tableau 8 – the official guide by George Peck

tableau 8 guideThis review was first published at ScraperWiki.

A while back I reviewed Larry Keller’s book The Tableau 8.0 Training Manual, at the same time I ordered George Peck’s book Tableau 8: the official guide. It’s just arrived. The book comes with a DVD containing bonus videos featuring George Peck’s warm, friendly tones and example workbooks. I must admit to being mildly nonplussed at receiving optical media, my ultrabook lacking an appropriate drive, but I dug out the USB optical drive to load them up. Providing an online link would have allowed the inclusion of up to date material, perhaps covering the version 8.1 announcement.

Tableau is a data visualisation application, aimed at the business intelligence area and optimised to look at database shaped data. I’m using Tableau on a lot of the larger datasets we get at ScraperWiki for sense checking and analysis.

Colleagues have noted that analysis in Tableau looks like me randomly poking buttons in the interface. From Peck’s book I learn that the order in which I carry out random clicking is important since Tableau will make a decision on what you want to see based both on what you have clicked and also its current state.

To my mind the heavy reliance on the graphical interface is one of the drawbacks of Tableau, but clearly, to business intelligence users and journalists, it’s the program’s greatest benefit. It’s a drawback because capturing what you’ve done in a GUI is tricky. Some of the scripting/version control capability is retained since most Tableau files are in plain XML format with which a little fiddling is tacitly approved by Tableau – although you won’t find such info in The Official Guide. I’ve been experimenting with using git source control on workbook files, and it works.

If you’re interested in these more advanced techniques then the Tableau Knowledgebase is worth a look. See this article, for example, on making a custom colour palette. I also like the Information Lab blog, 5 things I wish I knew about Tableau when I started and UK Area Polygon Mapping in TableauThe second post covers one of the bug-bears for non-US users of Tableau: the mapping functionality is quite US-centric.

Peck covers most of the functionality of Tableau, including data connections, making visualisations, a detailed look at mapping, dashboards and so forth. I was somewhat bemused to see the scatter plot described as “esoteric”. This highlights the background of those typically using Tableau: business people not physical scientists, and not necessarily business people who understand database query languages. Hence the heavy reliance on a graphical user interface.

I particularly liked the chapters on data connections which also described the various set, group and combine operations. Finally I understand the difference between data blending and data joining: joining is done at source between tables on the same database whilst blending is done on data from different sources by Tableau, after it has been loaded. The end result is not really different.

I now understand the point of table calculations – they’re for the times when you can’t work out your SQL query. Peck uses different language from Tableau in describing table calculations. He uses “direction” to refer to the order in which cells are processed and “scope” to refer to the groups over which cell calculations are performed. Tableau uses the terms “addressing” and “partitioning” for these two concepts, respectively.

Peck isn’t very explicit about the deep connections between SQL and Tableau but makes sufficient mention of the underlying processes to be useful.

It was nice to see a brief, clear description of the options for publishing Tableau workbooks. Public is handy and free if you want to publish to all. Tableau Online presents a useful halfway house for internal publication whilst Tableau Server gives full flexibility in scheduling updates to data and publishing to a range of audiences with different permission levels. This is something we’re interested in at ScraperWiki.

The book ends with an Appendix of functions available for field calculations.

In some ways Larry Keller and George Peck’s books complement each other, Larry’s book (which I reviewed here) contains the examples that George’s lacks and George’s some of the more in depth discussion missing from Larry’s book.

Overall: a nicely produced book with high production values, good but not encyclopedic coverage.

Git!

logo@2x

This post was first published at ScraperWiki.

As software company, use of some sort of software source control system is inevitable, indeed our CEO wrote TortoiseCVS – a file system overlay for the early CVS source control system. For those uninitiated in the joys of software engineering: source control is a system for recording the history of file revisions allowing programmers to edit their code, safe in the knowledge that they can always revert to a previous good state of code if it all goes horribly wrong. We use Git for source control, hosted either on Github or on Bitbucket. The differing needs of our platform and data services teams fit the payment plans of the two different sites.

Git is a distributed source control system created by Linus Torvalds, to support the development of Linux. Git is an incredibly flexible system which allows you to do pretty much anything. But what should you do? What should be your strategy for collective code development? It’s easy to look up a particular command to do a particular thing, but less is written on how you should string your git commands together. Here we hope to address this lack.

We use the “No Switch Yard” methodology, this involves creating branches from the master branch on which to develop new features and regularly rebasing against the master branch so that when the time comes the feature branch can be merged into the master branch via a pull request with little fuss. We should not be producing a byzantine system by branching feature branches from other feature branches. The aim of “No Switch Yard” is to make the history as simple as possible and make merging branches back onto master as easy as possible.

How do I start?

Assuming that you already have some code in a repository, create a local clone of that repository:

git clone [email protected]:scraperwiki/myproject.git

Create a branch:

git checkout -b my-new-stuff

Start coding…adding files and committing changes as you go:

git add -u
git commit -m "everything is great"

The -u switch to git add simply checks in all the tracked, uncommitted files. Depending on your levels of paranoia you can push your branch back to the remote repository:

git push

How do I understand what’s going on?

For me the key revelation for workflow was to be able to find out my current state and feel pleasure when it was good! To do this, fetch any changes that may have been made on your repository:

git fetch

and then run:

git log --oneline --graph --decorate --all

To see an ASCII art history diagram for your repository. What you are looking for here is a relatively simple branching structure without too many parallel tracks and with the tips of each branch lined up between your local and the remote copy.
You can make an alias to simplify this inspection:

git config --global alias.lg 'log --oneline --graph --decorate'

Then you can just do:

git lg --all

I know someone else has pushed to the master branch from which I branched – what should I do?

If stuff is going on on your master branch, perhaps because your changes are taking a while to complete, you should rebase. You should also do this just before submitting a pull request to merge your work with the master branch.

git rebase -i

Allows you to rebase interactively, this means you can combine multiple commits into a single larger commit. You might want to do this if you made lots of little commits whilst achieving a single goal. Rebasing brings you up to date with another branch, without actually merging your changes into that branch.

I’m done, how do I give my colleagues the opportunity to work on my great new features?

You need to rebase against the remote branch onto which you wish to merge your code and then submit a pull request for your changes. You can submit a pull request from the web interface at Github or Bitbucket. Or you can use a command line tool such as hub.  The idea of using a pull request is that it makes your changes visible to your colleagues, and keeps a clear record of those changes. If you’ve been rebasing regularly you should be able to merge your code automatically.

An important principle here is “ownership”, in social terms you own your local branch on which you are developing a feature, so you can do what you like with it. The master branch from which you started work is in collective ownership so you should only merge changes onto it with the permission of your colleagues and ideally you want others to look at your changes and approve the pull themselves.

I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?

You need to stash your code, using:

git stash

Then create a branch, as described above, and then retrieve the contents of the stash:

git stash pop

That’s how we use git – what do you do?