Author's posts
May 29 2014
Book review: Learning SPARQL by Bob DuCharme
This review was first published at ScraperWiki.
The NewsReader project on which we are working at ScraperWiki uses semantic web technology and natural language processing to derive meaning from the news. We are building a simple API to give access to the NewsReader datastore, whose native interface is SPARQL. SPARQL is a SQL-like query language used to access data stored in the Resource Description Framework format (RDF).
I reach Bob DuCharme’s book, Learning SPARQL, through an idle tweet mentioning SPARQL, to which his book account replied. The book covers the fundamentals of the semantic web and linked data, the RDF standard, the SPARQL query language, performance, and building applications on SPARQL. It also talks about ontologies and inferencing which are built on top of RDF.
As someone with a slight background in SQL and table-based databases, my previous forays into the semantic web have been fraught since I typically start by asking what the schema for an RDF store is. The answer to this question is “That’s the wrong question”. The triplestore is the basis of all RDF applications, as the name implies each row contains a triple (i.e. three columns) which are traditionally labelled subject, predicate and object. I found it easier to think in terms of resource, property name and property value. To give a concrete example “David Beckham” is an example of a resource, his height is the name of a property of David Beckham and, according to dbpedia, the value of this property is 1.8288 (metres, we must assume). The resource and property names must be provided in the from of URIs (unique resource identifiers) the property value can be a URI or some normally typed entity such as a string or an integer.
The triples describe a network of nodes (the resource and property values) with property names being the links between them, with this infrastructure any network can be described by a set of triples. SPARQL is a query language that superficially looks much like SQL. It can extract arbitrary sets of properties from the network using the SELECT command, get a valid sub-network described by a set of triples using the CONSTRUCT command, answer a question with a Yes/No answer using the ASK command. And it can tell you “everything” it knows about a particular URI using the DESCRIBE command, where “everything” is subject to the whim of the implementor. It also supports a bunch of other commands which feel familiar to SQListas such as LIMIT, OFFSET, FROM, WHERE, UNION, ORDER BY, GROUP BY, and AS. In addition there are the commands BIND which allows the transformation of variables by functions and VALUES which allows you to make little data structures for use within queries. PREFIX provides shortcuts to domains of URIs, for example http://dbpedia.org/resource/David_Beckham can be written dbpedia:David_Beckham, where dbpedia: is the prefix. SERVICE allows you to make queries across the internet to other SPARQL providers. OPTIONAL allows the addition of a variable which is not always present.
The core of a SPARQL query is a list of triples which act as selectors for the triples required and FILTERs which further filter the results by carrying out calculations on the individual members of the triple. Each selector triple is terminated with “ .” or a “ ;” which indicates that the next triple is as a double with the first element the same as the current one. I mention this because Googling for the meaning of punctuation is rarely successful.
Whilst reading this book I’ve moved from SPARQL querying by search, to queries written by slight modification of existing queries to celebrating writing my own queries, to writing successful queries no longer being a cause for celebration!
There are some features in SPARQL that I haven’t yet used in anger: “paths” which are used to expand queries to not just select a triple define a node with a link but longer chains of links and inferencing. Inferencing allows the creation of virtual triples. For example if we known that Brian is the patient of a doctor called Jane, and we have an inferencing engine which also contains the information the a patient is the inverse of a doctor then we don’t need to specify that Jane has a patient called Brian.
The book ends with a cookbook of queries for exploring a new data source which is useful but needs to be used with a little caution when query against large databases. Most of the book is oriented around running a SPARQL client against files stored locally. I skipped this step, mainly using YASGUI to query the NewsReader data and the SNORQL interface to dbpedia.
Overall summary, a readable introduction to the semantic web and the SPARQL query language.
If you want to see the fruits of my reading then there are still places available on the NewsReader Hack Day in London on 10th June.
Sign up here!
May 15 2014
Book review: The Undercover Economist Strikes Back by Tim Harford
What have been reading?
Tim Harford’s latest book "The Undercover Economist Strikes Back". It’s about macroeconomics, a sort of blaggers guide.
Who’s Tim Harford?
Tim Harford is a writer and broadcaster. I’ve also read his books The Undercover Economist, about microeconomics and Adapt, about trial and error in business, government and aid. When I get the time I listen to his radio programme More or Less, about statistics and numbers, and also read his newspaper column.
Hey, what’s going on here? You keep writing down the questions I’m asking!
Yes, this is how Strikes Back is written. At the beginning I found it a bit irritating but as you can see I’ve taken to it. It recalls the method Socratic dialogue and Galileo’s book, Dialogue Concerning the Two Chief World Systems. The advantage is that it structures the text very nicely and is likely rather SEO friendly.
OK, I’ll play along – tell me more about the book
The book starts by introducing Bill Philips and his MONIAC machine, which simulated the economy, in macroeconomic terms, using water, pipes, tanks and valves.
That’s a bizarre idea, why didn’t he use a computer?
Philips was working in the period immediately after the Second World War and computers weren’t that common. Also, it turns out that solving certain types of equations is more easily done using an analogue computer – such as MONIAC.
Back up a bit, what’s macroeconomics?
Macroeconomics is the study of the large scale features of the economy such as the growth in Gross Domestic Product (GDP), unemployment, inflation and so forth. Contrast this to microeconomics which is about how much you pay for your cup of tea (and other things).
What’s the point of this, didn’t someone describe economics as the “dismal science”?
Yes, they did but this is treating economics a little unfairly. One of Harford’s pleas in the book is to accept the humanity of economists. They aren’t just interested in numbers, they are interested in making numbers work for people. In particular, unemployment is recognised as a great ill which should be minimised and the argument is over how this should be achieved rather than whether it should be achieved.
Tell me something about macroeconomics
There is a great divide in economics between the Keynesians and the classical economists. The crux of their divide is how they treat a recession. The former believe that the economy needs stimulus in times of recession, in terms of of increased “printing of money”. The latter believe that the economy is a well-oiled machine that is de-railed by external shocks, in happy times there are other external shocks that pass off relatively benignly. The classicist are less keen on stimulus believing that the economy will sort itself out naturally as it responds to the external shocks. These approaches can be captured in toy economies.
Tim Harford cites two examples: a babysitting collective in Washington DC and the economy of a prisoner of war camp. The former is a case of a malfunctioning economy fixed by Keynesian means: the collective worked by parents agreeing to babysit in exchange for vouchers which represented a period of babysitting. But the amount of vouchers available was limited so parents were reluctant to spend their babysitting vouchers for a night out because they were scarce. In the first instance this was resolved by printing more baby sitting vouchers: a Keynesian stimulus.
The prisoner of war camp suffered a different problem, towards the end of the war the price of goods went up as the supply of Red Cross parcels dried up. Here there was nothing to be done, the de facto unit of currency was the cigarette and there was a limited supply of them and nothing could be done to increase that supply.
It’s all about money, isn’t it?
Yes, Harford highlights that money fulfils three different functions. It’s a medium of exchange, to save us from bartering. It’s a store of value, we can keep money under the bed for the future – something we couldn’t do with our valuable goods if they were valuable. And it is a “unit of account”, a way of summing up your net worth over a range of assets.
Is The Undercover Economist Strikes Back worth reading?
I’d say a definite “yes”. We’ve all been watching macroeconomics playing out in lively form over the last few years as the recession hit and is now receding. Harford gives a clear, intelligent guide to the issues at hand and some of the background that is left unstated by politicians and in the news. Harford points out that our political habits don’t really match our economic needs. Ideally we would have abstemious, right-wing governments in the boom years and somewhat more spendthrift left-wing ones during recessions. He ends with a call for more experimentation in macroeconomics, harking back to his book Adapt. And also highlights some shortcomings of macroeconomics as studied today: it does not consider behavioural economics, complexity theory or even banks.
There’s much more in the book than I’ve summarised here.
May 04 2014
The London Underground: Should I walk it?
This post was first published at ScraperWiki.
With a second tube strike scheduled for Tuesday I thought I should provide a useful little tool to help travellers cope! It is not obvious from the tube map but London Underground stations can be surprisingly close together, very well within walking distance.
Using this tool, you can select a tube station and the map will show you those stations which are within a mile and a half of it. 1.5 miles is my definition of a reasonable walking distance. If you don’t like it you can change it!
The tool is built using Tableau. The tricky part was allowing the selection of one station and measuring distances to all the others. Fortunately it’s a problem which has been solved, and documented, by Jonathan Drummey over on the Drawing with Numbers blog.
I used Euston as an origin station to demonstrate in the image below. I’ve been working at the Government Digital Service (GDS), sited opposite Holborn underground station, for the last couple of months. Euston is my mainline arrival station and I walk down the road to Holborn. Euston is coloured red in the map, and stations within a mile and a half are coloured orange. The label for Holborn does not appear by default but it’s the one between Chancery Lane and Tottenham Court Road. In the bottom right is a table which lists the walking distance to each station, Holborn appears just off the bottom and indicates a 17 minute walk – which is about right.
The map can be controlled by moving to the top left and using the controls that should appear there. Shift+left mouse button allows panning of the map view. A little glitch which I haven’t sorted out is that when you change the origin station the table of stations does not re-sort automatically, the user must click around the distance label to re-sort. Any advice on how to make this happen automatically would be most welcome.
Distances and timings are approximate. I have the latitude and longitude for all the stations following my earlier London Underground project which you can see here. The distances I calculate by taking the Euclidean distance between stations in angular units and multiplying by a factor which gives distances approximately the same as those in Google Maps. So it isn’t a true “as the crow flies” distance but is proportional to it. The walking times are calculated by assuming a walking speed of 3 miles and hour. If you put your cursor over a station you’ll see the name of the station with the walking time and distance from your origin station.
A more sophisticated approach would be to extract more walking routes from Google Maps and use that to calculate distances and times. This would be rather more complicated to do and most likely not worth the effort, except if you are going South of the river.
Mine is not the only effort in this area, you can see a static map of walking distances here.
May 02 2014
Book review: Data Science for Business by Provost and Fawcett
This review was first published at ScraperWiki.
Marginalia are an insight into the mind of another reader. This struck me as a I read Data Science for Business by Foster Provost and Tom Fawcett. The copy of the book had previously been read by two of my colleagues. One of whom had clearly read the introductory and concluding chapters but not the bit in between. Also they would probably not be described as a capitalist, “red in tooth and claw”! My marginalia have generally been hidden since I have an almost religious aversion to defacing a book in any way. I do use Evernote to take notes as I go though, so for this review I’ll reveal them here.
Data Science for Business is the book I wasn’t going to read since I’ve already read Machine Learning in Action, Data Mining: Practical Machine Learning Tools and Techniques, and Mining the Social Web. However, I gave in to peer pressure. The pitch for the book is that it is for people who will manage data scientists rather than necessarily be data scientists themselves. The implication here is that you’re paying these data scientists to increase your profits, so you better make sure that’s what they’ll do. You need to be able to understand what data science can and cannot do, ask reasonable questions of data scientists of their models and understand the environment the data scientist needs to thrive.
The book covers several key algorithms: decision trees, support vector machines, logistic regression, k-Nearest Neighbours and term frequency-inverse document frequency (TF-IDF) but not in any great depth of implementation. To my mind it is surprisingly mathematical in places, given the intended audience of managers rather than scientists.
The strengths of the book are in the explanations of the algorithms in visual terms, and in its focus on the expected value framework for evaluating data mining models. Diversity of explanation is always a good thing; read enough different explanations and one will speak directly to you. It also spends more of its time discussing practical applications than other books on data mining. An example on “churn” runs through the book. “Churn” is the loss of customers at the end of a contract, in this case the telecom industry is used as an illustration.
A couple of nuggets I picked up:
- You can think of different machine learning algorithms in terms of the decision boundary they produce and how that looks. Overfitting becomes a decision boundary which is disturbingly intricate. Support vector machines put the decision boundary as far away from the classes they separate as possible;
- You need to make sure that the attributes that you use to build your model will be available at the point of use. That’s to say there is no point in building a model for churn which needs an attribute from a customer which is only available just after they’ve left you. Sounds a bit obvious but I can easily see myself making this mistake;
- The expected value framework for evaluating models. This combines the probability of an event, i.e. the result of a promotion campaign with the value of the outcome. Again churn makes a useful demonstration. If you have the choice between a promotion which is successful with 10 users with an average spend of £10 per year or 1 user with an average spend of £200 then you should obviously go with the latter rather than the former. This reminds me of expectation values in quantum mechanics and in statistical physics.
The title of the book, and the related reading demonstrate that data science, machine learning and data mining are used synonymously. I had a quick look at the popularity of these terms over the last few years. You can see the results in the Google Ngram viewer here. Somewhat to my surprise data science still lags far behind other terms despite the recent buzz, this is perhaps because Google only expose data to 2008.
Which book should you read?
All of them!
If you must buy only one then make it Data Mining, it is encyclopaedic and covers high level business overview, toy implementation and detailed implementation in some depth. If you want to see the code, then get Machine Learning in Action – but be aware that ultimately you are most likely going to be using someone else’s implementation of the core machine learning algorithms. Mining the Social Web is excellent if you want to see the code and are particularly interested in social media. And read Data Science for Business if you are the intended managerial audience or one who will be doing data mining in a commercial environment.
Apr 28 2014
Visualising the London Underground with Tableau
This post was first published at ScraperWiki.
I’ve always thought of the London Underground as a sort of teleportation system. You enter a portal in one place, and with relatively little effort appeared at a portal in another place. Although in Star Trek our heroes entered a special room and stood well-separated on platforms, rather than packing themselves into metal tubes.
I read Christian Wolmar’s book, The Subterranean Railway about the history of the London Underground a while ago. At the time I wished for a visualisation for the growth of the network since the text description was a bit confusing. Fast forward a few months, and I find myself repeatedly in London wondering at the horror of the rush hour underground. How do I avoid being forced into some sort of human compression experiment?
Both of these questions can be answered with a little judicious visualisation!
First up, the history question. It turns out that other obsessives have already made a table containing a list of the opening dates for the London Underground. You can find it here, on wikipedia. These sortable tables are a little tricky to scrape, they can be copy-pasted into Excel but random blank rows appear. And the data used to control the sorting of the columns did confuse our Table Xtract tool, until I fixed it – just to solve my little problem! You can see the number of stations opened in each year in the chart below. It all started in 1863, electric trains were introduced in the very final years of the 19th century – leading to a burst of activity. Then things went quiet after the Second World War, when the car came to dominate transport.
Originally I had this chart coloured by underground line but this is rather misleading since the wikipedia table gives the line a station is currently on rather than the one it was originally built for. For example, Stanmore station opened in 1932 as part of the Metropolitan line, it was transferred to the Bakerloo line in 1939 and then to the Jubilee line in 1979. You can see the years in which lines opened here on wikipedia, where it becomes apparent that the name of an underground line is fluid.
So I have my station opening date data. How about station locations? Well, they too are available thanks to the work of folk at Openstreetmap, you can find that data here. Latitude-longitude coordinates are all very well but really we also need the connectivity, and what about Harry Beck’s iconic “circuit diagram” tube map? It turns out both of these issues can be addressed by digitizing station locations from the modern version of Beck’s map. I have to admit this was a slightly laborious process, I used ImageJ to manually extract coordinates.
I’ve shown the underground map coloured by the age of stations below.
Deep reds for the oldest stations, on the Metropolitan and District lines built in the second half of the 19th century. Pale blue for middle aged stations, the Central line heading out to Epping and West Ruislip. And finally the most recent stations on the Jubilee line towards Canary Wharf and North Greenwich are a darker blue.
Next up is traffic, or how many people use the underground. The wikipedia page contains information on usage, in terms of millions of passengers per year in 2012 covering both entries and exits. I’ve shown this data below with traffic shown at individual stations by the thickness of the line.
I rather like a “fat lines” presentation of the number of people using a station, the fatter the line at the station the more people going in and out. Of course some stations have multiple lines so get an unfair advantage. Correcting for this it turns out Canary Wharf is the busiest station on the underground, thankfully it’s built for it. Small above ground beneath it is a massive, cathedral-like space.
More data is available as a result of a Freedom of Information request (here) which gives data broken down by passenger action (boarding or alighting), underground line, direction of travel and time of day – broken down into fairly coarse chunks of the day. I use this data in the chart below to measure the “commuteriness” of each station. To do this I take the ratio of people boarding trains in the 7am-10am time slot with those boarding 4pm-7pm. For locations with lots of commuters, this will be a big number because lots of people get on the train to go to work in the morning but not many get on the train in the evening, that’s when everyone is getting off the train to go home.
By this measure the top five locations for “commuteriness” are:
- Pinner
- Ruislip Manor
- Elm Park
- Upminster Bridge
- Burnt Oak
It was difficult not to get sidetracked during this project, someone used the Freedom of Information Act to get the depths of all of the underground stations, so obviously I had to include that data too! The deepest underground station is Hampstead, in part because the station itself is at the top of a steep hill.
I’ve made all of this data into a Tableau visualisation which you can play with here. The interactive version shows you details of the stations as your cursor floats over them, allows you to select individual lines and change the data overlaid on the map including the depth and altitude data that.