Tag: data science

The London Underground – Can I walk it?

caniwalkitThere are tube strikes planned for 25th August 2015 and 28th August 2015 with disruption through the week. The nature of the London Underground means that it is not all obvious that walks between stations can be quite short. This blog post introduces a handy tool to help you work out “Can I walk it?

You can find the tool here:


To use it start by selecting the station you want to walk from, either by using the “Where am I?” dropdown or by clicking one of the coloured station symbols (or close to it). The map will then refresh, the station you selected is marked by a red disk, the stations within 1.5 miles of the starting station are marked by an orange disk and those more than 1.5 miles away are marked by a blue disk. 1.5 miles is my “walkable” threshold, it takes me about 25 minutes to walk that far. You can enter your own “walkable” threshold in the “I will walk” box and press refresh or select a new starting station to refresh the map.

The station markers will show the station names on mouseover, and the distances to the starting station once it has been selected.

This tool comes with no guarantees, the walking distances are estimated and these estimates may be faulty, particularly for river crossings. Weather conditions may make walking an unpleasant or unwise decision. The tool relies on the user to supply their own reasonable walking threshold. Your mileage may vary.

To give a little background to this project: I originally made this tool using Tableau. It was OK but tied to the Tableau Public platform. I felt it was a little slow and unresponsive. It followed some work I’d done visualising data relating to the London Underground which you can read about here.

As an exercise I thought I’d try to make a “Can I walk it?” web application, re-writing the original visualisation in JavaScript and Python. I’ve been involved with projects like this at ScraperWiki but never done the whole thing for myself. I used the leaflet.js library to provide the mapping, the Flask library in Python to serve the data, Boostrap to make it look okay and Docker containers on Digital Ocean to deploy the application.

The underlying data for this tool comes from Open Street Map, where the locations of all the London Underground stations are encoded as latitude and longitude. With this information in hand it is possible to calculate the distances between stations. Really I want the “walking distance” between stations rather than the crow flies distance which is what this data gives me. Ideally to get the walking distance I’d use Google Directions API but unfortunately this has a rate limit of 2500 calls per day and I need to make about 36000 calls to get all the data I need!

The code is open source and available in this BitBucket repository:


Comments and feedback are welcome!

Book review: Docker Up & Running by Karl Matthias and Sean P. Kane

This review was first published at ScraperWiki.

This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia

This post was first published at ScraperWiki.
learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Book review: Mastering Gephi Network Visualisation by Ken Cherven

1994_7344OS_Mastering Gephi Network Visualization

This review was first posted at ScraperWiki.

A little while ago I reviewed Ken Cherven’s book Network Graph Analysis and Visualisation with Gephi, it’s fair to say I was not very complementary about it. It was rather short, and had quite a lot of screenshots. It’s strength was in introducing every single element of the Gephi interface. This book, Mastering Gephi Network Visualisation by Ken Cherven is a different, and better, book.

Networks in this context are collections of nodes connected by edges, networks are ubiquitous. The nodes may be people in a social network, and the edges their friendships. Or the nodes might be proteins and metabolic products and the edges the reaction pathways between them. Or any other of a multitude of systems. I’ve reviewed a couple of other books in this area including Barabási’s popular account of the pervasiveness of networks, Linked, and van Steen’s undergraduate textbook, Graph Theory and Complex Networks, which cover the maths of network (or graph) theory in some detail.

Mastering Gephi is a practical guide to using the Gephi Network visualisation software, it covers the more theoretical material regarding networks in a peripheral fashion. Gephi is the most popular open source network visualisation system of which I’m aware, it is well-featured and under active development. Many of the network visualisations you see of, for example, twitter social networks, will have been generated using Gephi. It is a pretty complex piece of software, and if you don’t want to rely on information on the web, or taught courses then Cherven’s books are pretty much your only alternative.

The core chapters are on layouts, filters, statistics, segmenting and partitioning, and dynamic networks. Outside this there are some more general chapters, including one on exporting visualisations and an odd one on “network patterns” which introduced diffusion and contagion in networks but then didn’t go much further.

I found the layouts chapter particularly useful, it’s a review of the various layout algorithms available. In most cases there is no “correct” way of drawing a network on a 2D canvas, layout algorithms are designed to distribute nodes and edges on a canvas to enable the viewer to gain understanding of the network they represent.  From this chapter I discovered the directed acyclic graph (DAG) layout which can be downloaded as a Gephi plugin. Tip: I had to go search this plugin out manually in the Gephi Marketplace, it didn’t get installed when I indiscriminately tried to install all plugins. The DAG layout is good for showing tree structures such as organisational diagrams.

I learnt of the “Chinese Whispers” and “Markov clustering” algorithms for identifying clusters within a network in the chapter on segmenting and partitioning. These algorithms are not covered in detail but sufficient information is provided that you can try them out on a network of your choice, and go look up more information on their implementation if desired. The filtering chapter is very much about the mechanics of how to do a thing in Gephi (filter a network to show a subset of nodes), whilst the statistics chapter is more about the range of network statistical measures known in the literature.

I was aware of the ability of Gephi to show dynamic networks, ones that evolved over time, but had never experimented with this functionality. Cherven’s book provides an overview of this functionality using data from baseball as an example. The example datasets are quite appealing, they include social networks in schools, baseball, and jazz musicians. I suspect they are standard examples in the network literature, but this is no bad thing.

The book follows the advice that my old PhD supervisor gave me on giving presentations: tell the audience what you are go to tell them, tell them and then tell them what you told them. This works well for the limited time available in a spoken presentation, repetition helps the audience remember, but it feels a bit like overkill in a book. In a book we can flick back to remind us what was written earlier.

It’s a bit frustrating that the book is printed in black and white, particularly at the point where we are asked to admire the blue and yellow parts of a network visualisation! The referencing is a little erratic with a list of books appearing in the bibliography but references to some of the detail of algorithms only found in the text.

I’m happy to recommend this book as a solid overview of Gephi for those that prefer to learn from dead tree, such as myself. It has good coverage of Gephi features, and some interesting examples. In places it is a little shallow and repetitive.

The publisher sent me this book, free of charge, for review.

Book review: Cryptocurrency by Paul Vigna and Michael J. Casey


cryptocurrencyThis review was first posted at ScraperWiki.

Amongst hipster start ups in the tech industry Bitcoin has been a thing for a while. As one of the more elderly members of this community I wanted to understand a bit more about it. Cryptocurrency: How Bitcoin and Digital Money are Challenging the Global Economic Order by Paul Vigna and Michael Casey fits this bill.

Bitcoin is a digital currency: the Bitcoin has a value which can be exchanged against other currencies but it has no physical manifestation. The really interesting thing is how Bitcoins move around without any central authority, there is no Bitcoin equivalent of the Visa or BACS payment systems with their attendant organisations or central back as in the case of a normal currency. This division between Bitcoin as currency and Bitcoin as decentralised exchange mechanism is really important.

Conventional payment systems like Visa have a central organisation which charges retailers a percentage on every payment made using their system. This is exceedingly lucrative. Bitcoin replaces this with the blockchain – a distributed ledger in which transactions are encrypted. The validation is carried out by so-called ‘miners’ who are paid in Bitcoin for carrying out a computationally intensive encryption task which ensures the scarcity of Bitcoin and helps maintain its value. In principle anybody can be a Bitcoin miner, all they need is the required free software and the ability to run the software. The generation of new Bitcoin is strictly controlled by the fundamental underpinnings of the blockchain software. Bitcoin miners are engaged in a hardware arms race with each other as they compete to complete units on the blockchain, more processing power equals more chances to complete blocks ahead of the competition and hence win more Bitcoin. In practice mining meaningful quantities these days requires significant, highly specialised hardware.

Vigna and Casey provide a history of Bitcoin starting with a bit of background as to how economists see currency, this amounts to the familiar division between the Austrian school and the Keynesians. The Austrians are interested in currency as gold, whilst the Keynesians are interested in Bitcoin as a medium for exchange. As a currency Bitcoin doesn’t appeal to Keysians since there is no “quantitative easing” in Bitcoin, the government can’t print money.

Bitcoin did not appear from nowhere, during the late 90s and early years of the 20th century there were corporate attempts at building digital currencies. These died away, they had the air of lone wolf operations hidden within corporate structures which met their end perhaps when they filtered up to a certain level and their threat to the current business model was revealed. Or perhaps in the chaos of the financial collapse.

More pertinently there were the cypherpunks, a group interested in cryptography operating on the non-governmental, non-corporate side of the community. This group was also experimenting with ideas around digital currencies. This culminated in 2008 with the launch of Bitcoin, by the elusive Satoshi Nakamoto, to a cryptography mailing list. Nakamoto has since disappeared, no one has ever met him, no one knows whether he is the pseudonym of one of the cypherpunks, and if so, which one.

Following its release Bitcoin experienced a period of organic growth with cryptography enthusiasts and the technically curious. With the Bitcoin currency growing an ecosystem started to grow around it beginning with more user-friendly routes to accessing the blockchain – wallets to hold your Bitcoins, digital currency exchanges and tools to inspect the transactions on the blockchain.

Bitcoin has suffered reverses, most notoriously the collapse of the Mt Gox currency exchange and its use in the Silk Road market, which specialised in illegal merchandise. The Mt Gox collapse demonstrated both flaws in the underlying protocol and its vulnerability to poorly managed components in the ecosystem. Alongside this has been the wildly fluctuating value of the Bitcoin against other conventional currencies.

One of the early case studies in Cryptocurrency is of women in Afghanistan, forbidden by social pressure if not actual law from owning private bank accounts. Bitcoin provides them with a means for gaining independence and control over at least some financial resources. There is the prospect of it becoming the basis of a currency exchange system for the developing world where transferring money within a country or sending money home from the developed world are as yet unsolved problems, beset both with uncertainty and high costs.

To my mind Bitcoin is an interesting idea, as a traditional currency it feels like a non-starter but as a decentralized transaction mechanism it looks very promising. The problem with decentralisation is: who do you hold accountable? In two senses, firstly the technical sense – what if the software is flawed? Secondly, conventional currencies are backed by countries not software, a country has a stake in the success of a currency and the means to execute strategies to protect it. Bitcoin has the original vision of a vanished creator, and a very small team of core developers. As an aside Vigna and Casey point out there is a limit within Bitcoin of 7 transactions per second which compares with 10,000 transactions per second handled by the Visa network.

It’s difficult to see what the future holds for Bitcoin, Vigna and Casey run through some plausible scenarios. Cryptocurrency is well-written, comprehensive and pitched at the right technical level.