Tag: scraperwiki

Inordinately fond of beetles… reloaded!

sciencemuseum_logo

This post was first published at ScraperWiki.

Some time ago, in the era before I joined ScraperWiki I had a play with the Science Museums object catalogue. You can see my previous blog post here. It was at a time when I was relatively inexperienced with the Python programming language and had no access to Tableau, the visualisation software. It’s a piece of work I like to talk about when meeting customers since it’s interesting and I don’t need to worry about commercial confidentiality.

The title comes from a quote by J.B.S. Haldane, who was asked what his studies in biology had told him about the Creator. His response was that, if He existed then he was “inordinately fond of beetles”.

The Science Museum catalogue comprises three CSV files containing information on objects, media and events. I’m going to focus on the object catalogue since it’s the biggest one by a large margin – 255,000 objects in a 137MB file. Each object has an ID number which often encodes the year in which the object was added to the collection; a title, some description, it often has an “item name” which is a description of the type of object, there is sometimes information on the date made, the maker, measurements and whether it represents part or all of an object. Finally, the objects are labelled according to which collection they come from and which broad group in that collection, the catalogue contains objects from the Science Museum, Nation Railway Museum and National Media Museum collections.

The problem with most of these fields is that they don’t appear to come from a controlled vocabulary.

Dusting off my 3 year old code I was pleased to discover that the SQL I had written to upload the CSV files into a database worked almost first time, bar a little character encoding. The Python code I’d used to clean the data, do some geocoding, analysis and visualisation was not in such a happy state. Or rather, having looked at it I was not in such a happy state. I appeared to have paid no attention to PEP-8, the Python style guide, no source control, no testing and I was clearly confused as to how to save a dictionary (I pickled it).

In the first iteration I eyeballed the data as a table and identified a whole bunch of stuff I thought I needed to tidy up. This time around I loaded everything into Tableau and visualised everything I could – typically as bar charts. This revealed that my previous clean up efforts were probably not necessary since the things I was tidying impacted a relatively small number of items. I needed to repeat the geocoding I had done. I used geocoding to clean up the place of manufacture field, which was encoded inconsistently. Using the Google API via a Python library I could normalise the place names and get their locations as latitude – longitude pairs to plot on a map. I also made sure I had a link back to the original place name description.

The first time around I was excited to discover the Many Eyes implementation of bubble charts, this time I now realise bubble charts are not so useful. As you can see below in these charts showing the number of items in each subgroup. In a sorted bar chart it is very obvious which subgroup is most common and the relative sizes of the subgroup. I’ve coloured the bars by the major collection to which they belong. Red is the Science Museum, Green is the National Rail Museum and Orange is the National Media Museum.

image

Less discerning members of ScraperWiki still liked the bubble charts.

image

We can see what’s in all these collections from the item name field. This is where we discover that the Science Museum is inordinately fond of bottles. The most common items in the collection are posters, mainly from the National Rail Museum but after that there are bottles, specimen bottles, specimen jars, shops rounds (also bottles), bottle, drug jars, and albarellos (also bottles). This is no doubt because bottles are typically made of durable materials like glass and ceramics, and they have been ubiquitous in many milieu, and they may contain many and various interesting things.

image

Finally I plotted the place made for objects in the collection, this works by grouping objects by location and then finding latitude and longitude for those group location. I then plot a disk sized by the number of items originating at that location. I filtered out items whose place made was simply “England” or “London” since these made enormous blobs that dominated the map.

 

image

 

You can see a live version of these visualisation, and more on Tableau Public.

It’s an interesting pattern that my first action on uploading any data like this to Tableau is to do bar chart frequency plots for each column in the data, this could probably be automated.

In summary, the Science Museum is full of bottles and posters, Tableau wins for initial visualisations of a large and complex dataset.

Book review: Big data by Viktor Mayer-Schönberger and Kenneth Cukier

BigData

This review was first published at ScraperWiki.

We hear a lot about “Big Data” at ScraperWiki. We’ve always been a bit bemused by the tag since it seems to be used indescriminately. Just what is big data and is there something special I should do with it? Is it even a uniform thing?

I’m giving a workshop on data science next week and one of the topics of interest for the attendees is “Big Data”, so I thought I should investigate in a little more depth what people mean by “Big Data”. Hence I have read Big Data by Viktor Mayer-Schönberger and Kenneth Cukier, subtitled “A Revolution That Will Transform How We Live, Work and Think” – chosen for the large number of reviews it has attracted on Amazon. The subtitle is a guide to their style and exuberance.

Their thesis is that we can define big data, in contrast to earlier “little data”, by three things:

  • It’s big but not necessarily that big, their definition for big is that n = all. That is to say that in some domain you take all of the data you can get hold of. They use as one example a study on bout fixing in sumo wrestling, based on  data on 64,000 bouts – which would fit comfortably into a spreadsheet. Other data sets discussed are larger such as credit card transaction data, mobile telephony data, Google’s search query data…;
  • Big data is messy, it is perhaps incomplete or poorly encoded. We may not have all the data we want from every event, it may be encoded using free text rather than strict categories and so forth;
  • Working with big data we must discard an enthusiasm for causality and replace it with correlation. Working with big data we shouldn’t mind too much if our results are just correlations rather than explanations (causation);
  • An implicit fourth element is that the analysis you are going to apply to your big data is some form of machine learning.

I have issues with each of these “novel” features:

Scientists have long collected datasets and done calculations that are at the limit (or beyond) their ability to process the data produced. Think protein x-ray crystallography, astronomical data for navigation, the CERN detectors etc etc. You can think of the decadal censuses run by countries such as the US and UK as n = all. Or the data fed to the early LEO computer to calculate the deliveries required for each of their hundreds of teashops. The difference today is that people and companies are able to effortlessly collect a larger quantity of data than ever before. They’re able to collect data without thinking about it first. The idea of n = all is not really a help. The straw man against which it is placed is the selection of a subset of data by sampling.

They say that big data is messy implying that what went before was not. One of the failings of the book is their disregard for those researchers that have gone before. According to them the new big data analysts are comfortable with messiness and uncertainty, unlike those fuddy-duddy statisticians! Small data is messy, scientists and statisticians have long dealt with messy and incomplete data.

The third of their features: we must be comfortable with correlation rather than demand causation. There are many circumstances where correlation is OK, such as when Amazon uses my previous browsing and purchase history to suggest new purchases but the area of machine learning / data mining has long struggled with messiness and causality.

This is not to say nothing has happened in the last 20 or so years regarding data. The ubiquity of computing devices, cheap storage and processing power and the introduction of frameworks like Hadoop are all significant innovations in the last 20 years. But they grow on things that went before, they are not a paradigm shift. Labelling something as ‘big data’, so ill-defined, provides no helpful insight as to how to deal with it.

The book could be described as the “What Google Did Next…” playbook. It opens with Google’s work on flu trends, passes through Google’s translation work and Google Books project. It includes examples from many other players but one gets the impression that it is Google they really like. They are patronising of Amazon for not making full use of the data they glean from their Kindle ebook ecosystem. They pay somewhat cursory attention to issues of data privacy and consent, and have the unusual idea of creating a cadre of algorithmists who would vet the probity of algorithms and applications in the manner of accountants doing audit or data protection officers.

So what is this book good for? It provides a nice range of examples of data analysis and some interesting stories regarding the use to which it has been put. It gives a fair overview of the value of data analysis and some of the risks it presents. It highlights that the term “big data” is used so broadly that it conveys little meaning. This confusion over what is meant by “Big Data” is reflected on the datascience@Berkeley blog which lists definitions of big data from 30 people in the field (here). Finally, it provides me with sufficient cover to make a supportable claim that I am a “Big Data scientist”!

To my mind, the best definition of big data that I’ve seen is that it is like teenage sex…

  • Everyone talks about it,
  • nobody really knows how to do it,
  • everyone thinks everyone else is doing it,
  • so everyone claims they are doing it too!

Book review: Learning SPARQL by Bob DuCharme

learningsparql

This review was first published at ScraperWiki.

The NewsReader project on which we are working at ScraperWiki uses semantic web technology and natural language processing to derive meaning from the news. We are building a simple API to give access to the NewsReader datastore, whose native interface is SPARQL. SPARQL is a SQL-like query language used to access data stored in the Resource Description Framework format (RDF).

I reach Bob DuCharme’s book, Learning SPARQL, through an idle tweet mentioning SPARQL, to which his book account replied. The book covers the fundamentals of the semantic web and linked data, the RDF standard, the SPARQL query language, performance, and building applications on SPARQL. It also talks about ontologies and inferencing which are built on top of RDF.

As someone with a slight background in SQL and table-based databases, my previous forays into the semantic web have been fraught since I typically start by asking what the schema for an RDF store is. The answer to this question is “That’s the wrong question”. The triplestore is the basis of all RDF applications, as the name implies each row contains a triple (i.e. three columns) which are traditionally labelled subject, predicate and object. I found it easier to think in terms of resource, property name and property value. To give a concrete example “David Beckham” is an example of a resource, his height is the name of a property of David Beckham and, according to dbpedia, the value of this property is 1.8288 (metres, we must assume). The resource and property names must be provided in the from of URIs (unique resource identifiers) the property value can be a URI or some normally typed entity such as a string or an integer.

The triples describe a network of nodes (the resource and property values) with property names being the links between them, with this infrastructure any network can be described by a set of triples. SPARQL is a query language that superficially looks much like SQL. It can extract arbitrary sets of properties from the network using the SELECT command, get a valid sub-network described by a set of triples using the CONSTRUCT command, answer a question with a Yes/No answer using the ASK command. And it can tell you “everything” it knows about a particular URI using the DESCRIBE command, where “everything” is subject to the whim of the implementor. It also supports a bunch of other commands which feel familiar to SQListas such as LIMIT, OFFSET, FROM, WHERE, UNION, ORDER BY, GROUP BY, and AS. In addition there are the commands BIND which allows the transformation of variables by functions and VALUES which allows you to make little data structures for use within queries. PREFIX provides shortcuts to domains of URIs, for example http://dbpedia.org/resource/David_Beckham can be written dbpedia:David_Beckham, where dbpedia: is the prefix. SERVICE allows you to make queries across the internet to other SPARQL providers. OPTIONAL allows the addition of a variable which is not always present.

The core of a SPARQL query is a list of triples which act as selectors for the triples required and FILTERs which further filter the results by carrying out calculations on the individual members of the triple. Each selector triple is terminated with “ .” or a “ ;” which indicates that the next triple is as a double with the first element the same as the current one. I mention this because Googling for the meaning of punctuation is rarely successful.

Whilst reading this book I’ve moved from SPARQL querying by search, to queries written by slight modification of existing queries to celebrating writing my own queries, to writing successful queries no longer being a cause for celebration!

There are some features in SPARQL that I haven’t yet used in anger: “paths” which are used to expand queries to not just select a triple define a node with a link but longer chains of links and inferencing. Inferencing allows the creation of virtual triples. For example if we known that Brian is the patient of a doctor called Jane, and we have an inferencing engine which also contains the information the a patient is the inverse of a doctor then we don’t need to specify that Jane has a patient called Brian.

The book ends with a cookbook of queries for exploring a new data source which is useful but needs to be used with a little caution when query against large databases. Most of the book is oriented around running a SPARQL client against files stored locally. I skipped this step, mainly using YASGUI to query the NewsReader data and the SNORQL interface to dbpedia.

Overall summary, a readable introduction to the semantic web and the SPARQL query language.

If you want to see the fruits of my reading then there are still places available on the NewsReader Hack Day in London on 10th June.

Sign up here!

The London Underground: Should I walk it?

LU_logo

This post was first published at ScraperWiki.

With a second tube strike scheduled for Tuesday I thought I should provide a useful little tool to help travellers cope! It is not obvious from the tube map but London Underground stations can be surprisingly close together, very well within walking distance.

Using this tool, you can select a tube station and the map will show you those stations which are within a mile and a half of it. 1.5 miles is my definition of a reasonable walking distance. If you don’t like it you can change it!

The tool is built using Tableau. The tricky part was allowing the selection of one station and measuring distances to all the others. Fortunately it’s a problem which has been solved, and documented, by Jonathan Drummey over on the Drawing with Numbers blog.

I used Euston as an origin station to demonstrate in the image below. I’ve been working at the Government Digital Service (GDS), sited opposite Holborn underground station, for the last couple of months. Euston is my mainline arrival station and I walk down the road to Holborn. Euston is coloured red in the map, and stations within a mile and a half are coloured orange. The label for Holborn does not appear by default but it’s the one between Chancery Lane and Tottenham Court Road. In the bottom right is a table which lists the walking distance to each station, Holborn appears just off the bottom and indicates a 17 minute walk – which is about right.

Should I walk it

The map can be controlled by moving to the top left and using the controls that should appear there. Shift+left mouse button allows panning of the map view. A little glitch which I haven’t sorted out is that when you change the origin station the table of stations does not re-sort automatically, the user must click around the distance label to re-sort. Any advice on how to make this happen automatically would be most welcome.

Distances and timings are approximate. I have the latitude and longitude for all the stations following my earlier London Underground project which you can see here. The distances I calculate by taking the Euclidean distance between stations in angular units and multiplying by a factor which gives distances approximately the same as those in Google Maps. So it isn’t a true “as the crow flies” distance but is proportional to it. The walking times are calculated by assuming a walking speed of 3 miles and hour. If you put your cursor over a station you’ll see the name of the station with the walking time and distance from your origin station.

A more sophisticated approach would be to extract more walking routes from Google Maps and use that to calculate distances and times. This would be rather more complicated to do and most likely not worth the effort, except if you are going South of the river.

Mine is not the only effort in this area, you can see a static map of walking distances here.

Book review: Data Science for Business by Provost and Fawcett

datascienceforbusiness

This review was first published at ScraperWiki.

Marginalia are an insight into the mind of another reader. This struck me as a I read Data Science for Business by Foster Provost and Tom Fawcett. The copy of the book had previously been read by two of my colleagues. One of whom had clearly read the introductory and concluding chapters but not the bit in between. Also they would probably not be described as a capitalist, “red in tooth and claw”! My marginalia have generally been hidden since I have an almost religious aversion to defacing a book in any way. I do use Evernote to take notes as I go though, so for this review I’ll reveal them here.

Data Science for Business is the book I wasn’t going to read since I’ve already read Machine Learning in Action, Data Mining: Practical Machine Learning Tools and Techniques, and Mining the Social Web. However, I gave in to peer pressure. The pitch for the book is that it is for people who will manage data scientists rather than necessarily be data scientists themselves. The implication here is that you’re paying these data scientists to increase your profits, so you better make sure that’s what they’ll do. You need to be able to understand what data science can and cannot do, ask reasonable questions of data scientists of their models and understand the environment the data scientist needs to thrive.

The book covers several key algorithms: decision trees, support vector machines, logistic regression, k-Nearest Neighbours and term frequency-inverse document frequency (TF-IDF) but not in any great depth of implementation. To my mind it is surprisingly mathematical in places, given the intended audience of managers rather than scientists.

The strengths of the book are in the explanations of the algorithms in visual terms, and in its focus on the expected value framework for evaluating data mining models. Diversity of explanation is always a good thing; read enough different explanations and one will speak directly to you. It also spends more of its time discussing practical applications than other books on data mining. An example on “churn” runs through the book. “Churn” is the loss of customers at the end of a contract, in this case the telecom industry is used as an illustration.

A couple of nuggets I picked up:

  • You can think of different machine learning algorithms in terms of the decision boundary they produce and how that looks. Overfitting becomes a decision boundary which is disturbingly intricate. Support vector machines put the decision boundary as far away from the classes they separate as possible;
  • You need to make sure that the attributes that you use to build your model will be available at the point of use. That’s to say there is no point in building a model for churn which needs an attribute from a customer which is only available just after they’ve left you. Sounds a bit obvious but I can easily see myself making this mistake;
  • The expected value framework for evaluating models. This combines the probability of an event, i.e. the result of a promotion campaign with the value of the outcome. Again churn makes a useful demonstration. If you have the choice between a promotion which is successful with 10 users with an average spend of £10 per year or 1 user with an average spend of £200 then you should obviously go with the latter rather than the former. This reminds me of expectation values in quantum mechanics and in statistical physics.

The title of the book, and the related reading demonstrate that data science, machine learning and data mining are used synonymously. I had a quick look at the popularity of these terms over the last few years. You can see the results in the Google Ngram viewer here. Somewhat to my surprise data science still lags far behind other terms despite the recent buzz, this is perhaps because Google only expose data to 2008.

Which book should you read?

All of them!

If you must buy only one then make it Data Mining, it is encyclopaedic and covers high level business overview, toy implementation and detailed implementation in some depth. If you want to see the code, then get Machine Learning in Action – but be aware that ultimately you are most likely going to be using someone else’s implementation of the core machine learning algorithms. Mining the Social Web is excellent if you want to see the code and are particularly interested in social media. And read Data Science for Business if you are the intended managerial audience or one who will be doing data mining in a commercial environment.