SomeBeans

May 04 2014

The London Underground: Should I walk it?

By SomeBeans in Technology

This post was first published at ScraperWiki.

With a second tube strike scheduled for Tuesday I thought I should provide a useful little tool to help travellers cope! It is not obvious from the tube map but London Underground stations can be surprisingly close together, very well within walking distance.

Using this tool, you can select a tube station and the map will show you those stations which are within a mile and a half of it. 1.5 miles is my definition of a reasonable walking distance. If you don’t like it you can change it!

The tool is built using Tableau. The tricky part was allowing the selection of one station and measuring distances to all the others. Fortunately it’s a problem which has been solved, and documented, by Jonathan Drummey over on the Drawing with Numbers blog.

I used Euston as an origin station to demonstrate in the image below. I’ve been working at the Government Digital Service (GDS), sited opposite Holborn underground station, for the last couple of months. Euston is my mainline arrival station and I walk down the road to Holborn. Euston is coloured red in the map, and stations within a mile and a half are coloured orange. The label for Holborn does not appear by default but it’s the one between Chancery Lane and Tottenham Court Road. In the bottom right is a table which lists the walking distance to each station, Holborn appears just off the bottom and indicates a 17 minute walk – which is about right.

The map can be controlled by moving to the top left and using the controls that should appear there. Shift+left mouse button allows panning of the map view. A little glitch which I haven’t sorted out is that when you change the origin station the table of stations does not re-sort automatically, the user must click around the distance label to re-sort. Any advice on how to make this happen automatically would be most welcome.

Distances and timings are approximate. I have the latitude and longitude for all the stations following my earlier London Underground project which you can see here. The distances I calculate by taking the Euclidean distance between stations in angular units and multiplying by a factor which gives distances approximately the same as those in Google Maps. So it isn’t a true “as the crow flies” distance but is proportional to it. The walking times are calculated by assuming a walking speed of 3 miles and hour. If you put your cursor over a station you’ll see the name of the station with the walking time and distance from your origin station.

A more sophisticated approach would be to extract more walking routes from Google Maps and use that to calculate distances and times. This would be rather more complicated to do and most likely not worth the effort, except if you are going South of the river.

Mine is not the only effort in this area, you can see a static map of walking distances here.

data science, scraperwiki

May 02 2014

Book review: Data Science for Business by Provost and Fawcett

By SomeBeans in Book Reviews

This review was first published at ScraperWiki.

Marginalia are an insight into the mind of another reader. This struck me as a I read Data Science for Business by Foster Provost and Tom Fawcett. The copy of the book had previously been read by two of my colleagues. One of whom had clearly read the introductory and concluding chapters but not the bit in between. Also they would probably not be described as a capitalist, “red in tooth and claw”! My marginalia have generally been hidden since I have an almost religious aversion to defacing a book in any way. I do use Evernote to take notes as I go though, so for this review I’ll reveal them here.

Data Science for Business is the book I wasn’t going to read since I’ve already read Machine Learning in Action, Data Mining: Practical Machine Learning Tools and Techniques, and Mining the Social Web. However, I gave in to peer pressure. The pitch for the book is that it is for people who will manage data scientists rather than necessarily be data scientists themselves. The implication here is that you’re paying these data scientists to increase your profits, so you better make sure that’s what they’ll do. You need to be able to understand what data science can and cannot do, ask reasonable questions of data scientists of their models and understand the environment the data scientist needs to thrive.

The book covers several key algorithms: decision trees, support vector machines, logistic regression, k-Nearest Neighbours and term frequency-inverse document frequency (TF-IDF) but not in any great depth of implementation. To my mind it is surprisingly mathematical in places, given the intended audience of managers rather than scientists.

The strengths of the book are in the explanations of the algorithms in visual terms, and in its focus on the expected value framework for evaluating data mining models. Diversity of explanation is always a good thing; read enough different explanations and one will speak directly to you. It also spends more of its time discussing practical applications than other books on data mining. An example on “churn” runs through the book. “Churn” is the loss of customers at the end of a contract, in this case the telecom industry is used as an illustration.

A couple of nuggets I picked up:

You can think of different machine learning algorithms in terms of the decision boundary they produce and how that looks. Overfitting becomes a decision boundary which is disturbingly intricate. Support vector machines put the decision boundary as far away from the classes they separate as possible;
You need to make sure that the attributes that you use to build your model will be available at the point of use. That’s to say there is no point in building a model for churn which needs an attribute from a customer which is only available just after they’ve left you. Sounds a bit obvious but I can easily see myself making this mistake;
The expected value framework for evaluating models. This combines the probability of an event, i.e. the result of a promotion campaign with the value of the outcome. Again churn makes a useful demonstration. If you have the choice between a promotion which is successful with 10 users with an average spend of £10 per year or 1 user with an average spend of £200 then you should obviously go with the latter rather than the former. This reminds me of expectation values in quantum mechanics and in statistical physics.

The title of the book, and the related reading demonstrate that data science, machine learning and data mining are used synonymously. I had a quick look at the popularity of these terms over the last few years. You can see the results in the Google Ngram viewer here. Somewhat to my surprise data science still lags far behind other terms despite the recent buzz, this is perhaps because Google only expose data to 2008.

Which book should you read?

All of them!

If you must buy only one then make it Data Mining, it is encyclopaedic and covers high level business overview, toy implementation and detailed implementation in some depth. If you want to see the code, then get Machine Learning in Action – but be aware that ultimately you are most likely going to be using someone else’s implementation of the core machine learning algorithms. Mining the Social Web is excellent if you want to see the code and are particularly interested in social media. And read Data Science for Business if you are the intended managerial audience or one who will be doing data mining in a commercial environment.

data science, scraperwiki

Apr 28 2014

Visualising the London Underground with Tableau

By SomeBeans in Technology

This post was first published at ScraperWiki.

I’ve always thought of the London Underground as a sort of teleportation system. You enter a portal in one place, and with relatively little effort appeared at a portal in another place. Although in Star Trek our heroes entered a special room and stood well-separated on platforms, rather than packing themselves into metal tubes.

I read Christian Wolmar’s book, The Subterranean Railway about the history of the London Underground a while ago. At the time I wished for a visualisation for the growth of the network since the text description was a bit confusing. Fast forward a few months, and I find myself repeatedly in London wondering at the horror of the rush hour underground. How do I avoid being forced into some sort of human compression experiment?

Both of these questions can be answered with a little judicious visualisation!

First up, the history question. It turns out that other obsessives have already made a table containing a list of the opening dates for the London Underground. You can find it here, on wikipedia. These sortable tables are a little tricky to scrape, they can be copy-pasted into Excel but random blank rows appear. And the data used to control the sorting of the columns did confuse our Table Xtract tool, until I fixed it – just to solve my little problem! You can see the number of stations opened in each year in the chart below. It all started in 1863, electric trains were introduced in the very final years of the 19th century – leading to a burst of activity. Then things went quiet after the Second World War, when the car came to dominate transport.

Originally I had this chart coloured by underground line but this is rather misleading since the wikipedia table gives the line a station is currently on rather than the one it was originally built for. For example, Stanmore station opened in 1932 as part of the Metropolitan line, it was transferred to the Bakerloo line in 1939 and then to the Jubilee line in 1979. You can see the years in which lines opened here on wikipedia, where it becomes apparent that the name of an underground line is fluid.

So I have my station opening date data. How about station locations? Well, they too are available thanks to the work of folk at Openstreetmap, you can find that data here. Latitude-longitude coordinates are all very well but really we also need the connectivity, and what about Harry Beck’s iconic “circuit diagram” tube map? It turns out both of these issues can be addressed by digitizing station locations from the modern version of Beck’s map. I have to admit this was a slightly laborious process, I used ImageJ to manually extract coordinates.

I’ve shown the underground map coloured by the age of stations below.

Deep reds for the oldest stations, on the Metropolitan and District lines built in the second half of the 19th century. Pale blue for middle aged stations, the Central line heading out to Epping and West Ruislip. And finally the most recent stations on the Jubilee line towards Canary Wharf and North Greenwich are a darker blue.

Next up is traffic, or how many people use the underground. The wikipedia page contains information on usage, in terms of millions of passengers per year in 2012 covering both entries and exits. I’ve shown this data below with traffic shown at individual stations by the thickness of the line.

I rather like a “fat lines” presentation of the number of people using a station, the fatter the line at the station the more people going in and out. Of course some stations have multiple lines so get an unfair advantage. Correcting for this it turns out Canary Wharf is the busiest station on the underground, thankfully it’s built for it. Small above ground beneath it is a massive, cathedral-like space.

More data is available as a result of a Freedom of Information request (here) which gives data broken down by passenger action (boarding or alighting), underground line, direction of travel and time of day – broken down into fairly coarse chunks of the day. I use this data in the chart below to measure the “commuteriness” of each station. To do this I take the ratio of people boarding trains in the 7am-10am time slot with those boarding 4pm-7pm. For locations with lots of commuters, this will be a big number because lots of people get on the train to go to work in the morning but not many get on the train in the evening, that’s when everyone is getting off the train to go home.

By this measure the top five locations for “commuteriness” are:

Pinner
Ruislip Manor
Elm Park
Upminster Bridge
Burnt Oak

It was difficult not to get sidetracked during this project, someone used the Freedom of Information Act to get the depths of all of the underground stations, so obviously I had to include that data too! The deepest underground station is Hampstead, in part because the station itself is at the top of a steep hill.

I’ve made all of this data into a Tableau visualisation which you can play with here. The interactive version shows you details of the stations as your cursor floats over them, allows you to select individual lines and change the data overlaid on the map including the depth and altitude data that.

data science, data visualisation, scraperwiki, Tableau

Apr 07 2014

Book review: The Signal and the Noise by Nate Silver

By SomeBeans in Book Reviews

This review was first published at Scraperwiki.

Nate Silver first came to my attention during the 2008 presidential election in the US. He correctly predicted the outcome of the November results in 49 of 50 states, missing only on Indiana where Barack Obama won by just a single percentage point. This is part of a wider career in prediction: aside from a job at KPMG (which he found rather dull), he has been a professional poker player and run a baseball statistics website.

His book The Signal and the Noise: The Art and Science of Prediction looks at prediction in a range of fields: economics, disease, chess, baseball, poker, climate, weather, earthquakes and politics. It is both a review and a manifesto; a manifesto for making better prediction.

The opening chapter is on the catastrophic miscalculation of the default rate for collateralized debt obligations (CDO) which led in large part to the recent financial crash. The theme for this chapter is risk and uncertainty. In this instance, risk means the predicted default rate for these financial instruments and uncertainty means the uncertainty in those predictions. For CDOs the problem was that the estimates of risk were way off, and there was no recognition of the uncertainty in those risk estimates.

This theme of unrecognised uncertainty returns for the prediction of macroeconomic indicators such as GDP growth and unemployment. Here again forecasts are made, and practitioners in the field know these estimates are subject to considerable uncertainty, but the market for prediction, and simple human frailty mean that these uncertainties are ignored. I’ve written elsewhere on both the measurement and prediction of GDP. In brief, both sides of the equation are so fraught with uncertainty that you might as well toss a coin to predict GDP!

The psychology of prediction returns with a discussion of political punditry and the idea of “foxes” and “hedgehogs”. In this context “hedgehogs” are people with one Big Idea who make their predictions based on their Big Idea and are unshiftable from it. In the political arena the “Big Idea” may be ideology but as a scientist I can see that science can be afflicted in the same way. To a man with a hammer, everything is a nail. “Foxes”, on the other hand, are more eclectic and combine a range of data and ideas to predict, as a result they are more successful.

In some senses, the presidential prediction is a catching chickens in a coop exercise. There is quite a bit of data to use and your fellow political pundits are typically misled by their own prejudices, they’re “hedgehogs”, so all you, the “fox” needs to do is calculate in a fairly rational manner and you’re there. Silver returns to the idea of diversity in prediction, combining multiple models, or data of multiple types in his manifesto for better prediction.

There are several chapters on scientific prediction, looking at the predictions of earthquakes, the weather, climate and disease. There is not an overarching theme across these chapters. The point about earthquakes is that precise prediction about where, when and how big appears to be impossible. At short range, the weather is predictable but, beyond a few days, seasonal mean predictions and “the same as today” predictions are as good as the best computer simulations.

Other interesting points raised by this chapter are the ideas of judging predictions by accuracy, honesty and economic value. Honesty in this sense means “is this the best model I can make at this point?”. The interesting thing about weather forecasts in the US is that the National Weather Service makes the results of simulations available to all. Big national value-adders such as the Weather Channel produce a “wet bias” forecast which systematically overstates the possibility of rain for lower values of the chances of rain. This is because customers prefer to be told that there is, say, a 20% chance of rain when it actually turns out to be dry, than be told the actual chance of rain (say 5%) and for it to rain. This “wet bias” gives customers what they want.

Finally, there are predictions on games. On poker, baseball, basketball and chess. The benefits of these systems are the large amounts of data available. The US in particular has a thriving “gaming statistics” culture. The problems are closed, in the sense that there are a set of known rules to each game. And finally, there is ample opportunity for testing predictions against outcomes. This final factor is important in improving prediction.

In technical terms, Silver’s call to arms is for the Bayesian approach to predictions. With this approach, an attempt is made to incorporate prior beliefs into prediction through the use of an estimate of the prior probability and Bayes’ Theorem. Silver exemplifies this with a discussion of prediction in poker. Essentially, poker is about predicting your opponents’ hands and their subsequent actions based on those cards. To be successful, poker players must do this repeatedly and intuitively. However, luck still plays a large part in poker and only the very best players make a profit long term.

The book is structured around a series of interviews with workers in the field, and partly as a result of this it reads rather nicely – it’s not just a dry exposition of facts, but has some human elements too. Though it’s no replacement for a technical reference, it’s worth as a survey of prediction, as well as for the human interest stories.

data science, scraperwiki

Apr 05 2014

San Sebastian, aka Donostia

By SomeBeans in Miscellaneous

Once again I travel for work, more specifically to a NewsReader project meeting in San Sebastian, or in the Basque language, Donostia.

My trip there was mildly stressful, Easyjet fly direct Manchester-Bilbao but only on three days of the week – the wrong days for my meeting. Therefore I travelled via Charles De Gaulle Airport, which made a 1.5 hour transfer rather tight and led to me being obviously irritated with an immigration official. I should add that Charles De Gaulle Airport was the only stressful part of my trip.

Once at Bilbao it is a little over an hour on an express bus to reach Donostia. And it’s a rather odd experience. I’ve only visited Spain once in the past, to Barcelona and the rest of my knowledge of the country is from trashy TV about Brits on holiday and art house films (and books). The Basque country reminded me of the transfers from the airport to the ski resort I’ve made across the Alps. Steep-wooded valleys, chalet-style buildings scattered across hillsides with orchards and hay-meadows. Small towns wedged into the bottoms of valleys, apartment blocks creeping up the hillsides in the manner of the French rather than Austrian Alps. The only oddity is the occasional palm tree.

There is a fair amount of geology on display, I didn’t catch any photos from the bus but you get a hint of it from this photo of the bay at Donostia.

It was unseasonably warm whilst I was in Donostia, I remarked that I’d consider the temperature normal for Spain to one of my hosts, who replied; “You’ve made two mistakes there: (1) you’re not in Spain…”.

Once in Donostia, it is strikingly reminiscent of the North Wales coast and Llandudno! Fine buildings along a bay with a steep promontory looking down on the town.

The quality of the street furniture is rather better, and the buildings have a rather more wealthy feel. The local vegetation also reminds you that you’re not in Llandudno.

Apparently the Spanish royal family used to holiday in Donostia at the Miramar Palace, not really that palatial but it has fine views over the bay and the grounds reach down to the sea:

I wonder whether they used this rather ornate beach house:

The town hall is pretty impressive too:

I failed a bit on local cuisine only making it out one night of three, to the Cider House. Apparently a typically Basque thing. The dining room is reached past large barrels of cider, which are the main event. The cider drinking scheme is as follows: at regular intervals the patron shouted something, and the willing and able followed him to a selected barrel. He opened a small tap producing a two metre or so stream of cider, projected horizontally. The assembled drinkers catch a an inch or so of cider each in large plastic beakers. Points are awarded for catching the stream as far from the barrel as possible, and once started the drinker moves up the stream. The next drinker aligns themselves to catch the stream when the previous drinker moves out of the way. The result is quite splashy.

Between drinks there is a fixed menu of bread, cod omelette, cod and greens, the largest barbeque steaks I’ve ever seen, finishing with walnuts, quince jelly and cheese.

Aside from this my colleagues were keen on pintxos, the local take on the more widely known tapas.

I stayed at the NH Aranzazu which was very nice, not particularly expensive and very convenient for the university but not so much for the town centre.

Donostia, photography, San Sebastian

I've worked as a scientist for the last 30 years, at various universities, a large home and personal care company, a startup in Liverpool called The Sensible Code Company (formerly ScraperWiki Ltd), GBG and now as a consultant in data science.

I write about:
* the books I have read, typically science and history (or both), partly as a reminder to myself and partly as a review;
* science, things I have done or things I find interesting;
* technology, programming and gadgets;
politics, and current affairs;
* ...and other stuff as it takes my fancy - holidays, photographs and things I want to remember.

The London Underground: Should I walk it?

Book review: Data Science for Business by Provost and Fawcett

Visualising the London Underground with Tableau

Book review: The Signal and the Noise by Nate Silver

San Sebastian, aka Donostia

About

Recent Posts

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers

The London Underground: Should I walk it?

Book review: Data Science for Business by Provost and Fawcett

Visualising the London Underground with Tableau

Book review: The Signal and the Noise by Nate Silver

San Sebastian, aka Donostia

About

Recent Posts

Tags

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers