Book review: Docker Up & Running by Karl Matthias and Sean P. Kane

This review was first published at ScraperWiki.

This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia

This post was first published at ScraperWiki.
learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Portinscale 2015

We had an abortive trip to Portinscale in the Lake District for our summer holiday last year, ended prematurely by illness. This year we’re back and have improved greatly on last years performance! Portinscale is just outside Keswick, a small town at the head of Derwentwater. In the past we would have stayed a little further from civilisation so we could go for longish walks from the door but with 3 year old Thomas a bunch of attractions in easy distance is preferable.

Day 1 – Sunday

Rather than fit packing and driving the relatively short distance to Portinscale from Chester into a day, whilst simultaneously meeting the arrival time requirements, we travelled up on Sunday morning. In the afternoon we went to Whinlatter Forest Park, a few miles up the road. The entrance is guarded by a fine sculpture of an osprey.

IMG_6209

It has an extensive collection of trails for pedestrians and cyclists. A Go Ape franchise for people who like swinging from trees, some Gruffalo / Superworm themed trails for children. And a wild play area featuring Thomas’ favourite thing – a pair of Archimedes Screws:

IMG_6555 

There’s also a very nice cafe. We visited Whinlatter several times of an afternoon.

Day 2 – Monday

We went to Mirehouse in the morning, a lakeside estate with a smallish garden and a rather pleasant walk down to Bassenthwaite Lake.

IMG_6254

There’s a fine view from the lake down towards Keswick.

IMG_6268

In the afternoon we went to the Pencil Museum in Keswick, not a large attraction but Thomas liked Drew the giant and we got 5 pencils for an outlay of £3.

Day 3 – Tuesday

In the morning we went to Threlkeld Mining Museum. Its full of cranes and various bits of mining machinery from the past 100 years or so. There is a narrow gauge railway line which runs half a mile or so to the head of the quarry from the visitor centre. Threlkeld is not a slick affair but it is great fun for a small child fond of cranes, and the volunteers are obviously enthused by what they are doing. To be honest, I’m rather fond of industrial archaeology too!

Basically, they collect cranes.

IMG_6368

All of which are in some degree of elegant decay

IMG_6311

For our visit they were running a little diesel train:

IMG_6359

In the afternoon we walked down to Nichols End, a marina on Derwentwater close by our house in Portinscale.

1-IMG_6391

Day 4 – Wednesday

My records show that we last visited Maryport 15 years ago. It has the benefit of being close to Keswick – only half an hour or so away. We enjoyed a brief paddle in the sea, on a beach of our own before heading to the small aquarium in town.

1-IMG_6399

Whinlatter Forest Park once again in the afternoon.

Day 5 – Thursday

On leaving the house we thought we would be mooching around Keswick whilst our car was being seen to for “mysterious dripping”, as it was Crosthwaite Garage instantly diagnosed an innocuous air conditioning overflow. So we headed off to Lodore Falls, alongside Derwentwater before returning to Hope Park in Keswick.

Thomas declared the gently dripping woods on the way to Lodore Falls to be “amazing”:

IMG_6443

The falls themselves are impressive enough, although the view is a little distant when you are with a small child, who coincidently loves waterfalls and demands their presence on every walk:

IMG_6450

Hope Park was busy, but it is a pretty lakeside area with formal gardens and golf a little back from the shore.

IMG_6462

In the afternoon we visited Dodd Wood, which is just over the road from Mirehouse, where we did a rather steep walk.

Day 6 – Friday

On our final day we visited Allan Bank in Grasmere, this is a stealth National Trust property, formerly home to William Wordsworth and one of the founders of the National Trust, Canon Rawnsley. “Stealth” because it is barely advertised or sign posted, and is run in manner far more relaxed than any other National Trust place I’ve visited. It’s a smallish house:

Allan Bank, Grasmere

With glorious views:

IMG_6480

The house was damaged by fire a few years ago, and has only really been refurbished in as far as making it weather proof. Teas and coffees are available on unmatching crockery for a donation (you pay for cake though), and you’re invited to take them where you please to drink. There is a playroom ideally suited to Thomas’ age group, along with rooms Wordsworth and Rawnsley occupied upstairs.

IMG_6523

It has the air of a hippy commune, and it’s sort of glorious.

IMG_6532

Outside the grounds are thickly wooded on a steep slope, there is a path approximately around the perimeter which takes in the wild woods, several dens and some lovely views.

 IMG_6493

We glimpsed a red squirrel in the woods.

IMG_6498

As Thomas wrote, it was "”Fun”!

IMG_6527

In the afternoon a final trip to Whinlatter Forest Park.

We left on Saturday amidst heavy early morning rain, the only serious daytime rain of the holiday – probably the best week of weather I’ve had in the Lake District!

Book review: Your Inner Fish by Neil Shubin

yourinnerfishI’m holiday so I’ve managed some more reading! This time Your Inner Fish by Neil Shubin. As recommended by my colleague, David Jones, at ScraperWiki.

This is ostensibly a story of a particular distant ancestor of humans, the first to walk on land 375 million years ago, but in practice it is broader than that. It is more generally about what it is to be a modern palaeontologist and taxonomy – the classification of living organisms.

Your Inner Fish is a personal account based around the work Shubin and his colleagues did in discovering the Tiktaalik species, the first walker, in the high Canadian Arctic. It turns out the distinguishing features of such animals are the formation of shoulders and a neck, underwater a fish can easily reorient its whole body to get its head facing the right way, on land a neck to move the head independently and shoulders to mount the front legs become beneficial. Shubin hypotheses that animals such as Tiktaalik evolved to walk on dry land to evade ever larger and more aggressive aquatic predators.

Shubin recounts the process that led him to the Arctic, starting with his earlier fossil hunting in road cuts in Pennsylvania. The trick to fossil hunting being finding bedrock of the right age being exposed in moderate amounts. Road cuts are a second best in the this instance, being rather small in scale. Palaeontologists find their best hunting grounds in deserts and the barren landscape of the north. Finding the right site is a combination of identifying where rocks of the right age are likely to be exposed and knowing whether someone has looked there already.

Once you are in the field, the tricky part comes: finding the fossils. This is a skill akin to being able to resolve a magic eye puzzle. This is a skill which is learnt practically in the field rather than theoretically in the classroom. I’m struck by how small some of the most important fossil sites are, Shubin shows a photo of the Tiktaalik site where 6 people basically fill it. The Walcott Quarry in the Burgess Shale is similarly compact.

The central theme of the book is the one-ness of life, in the sense that humans share a huge amount of machinery with all living things to do with the business of building a body. These days the focus of such interest is on DNA, and the similarity of genes and the proteins they encode across huge spans of the tree of life. In earlier times these similarities were identified in developmental processes and anatomy. It is significant that researchers such as Shubin span the fossil, development and genetic domains.

Anatomically fish, lizards, mammals and birds represent the reshuffling of the same components. The multiple jaw bones found in sharks and skates turn into the bones of the inner ear in mammals. The arches which form gills in fish morph and adapt in mammals to leave a weird layout of nerves in the face and skull. These similarities in gross anatomical features are reflected in the molecular machinery which drives development, the formation of complex bodies from a single fertilised cell. Organiser molecules are common across vertebrates.

It’s worth noting the contribution of Hilde Mangold to the development story, her supervisor Hans Spemann won the 1935 Nobel Prize for medicine based in part on the work differentiation in amphibian embryos she had presented in her 1923 thesis. She died at the age of 26 in 1924 as the result of an explosion in her apartment building. Nobel Prizes are only awarded to the living.

Why study this taxonomy? The reasons are two-fold, there is the purely intellectual argument of “because it is there”. The shared features of life are one of the pieces of evidence underpinning the theory of evolution. The second reason is utilitarian, linking all of life into a coherent structure gives us a better understanding of our own bodies, and how to fix them if they go wrong.

As examples of our faulty body Shubin highlights hiccups and hernias. Hiccups because the reflexes leading to hiccups are the descendants of the reflexes of tadpoles which allowed them to breathe through gills as well as lungs. Hernias because the placement of the testes outside the abdomen is an evolution from our fish ancestors who kept gonads internally – external placement is a botched job which leads to a weakness in the abdomen wall, particularly in men.

This book is shorter and more personal than Richard Dawkins’ and Stephen Jay Gould’s work in similar vein.

I liked it.

Book review: Gut by Giulia Enders

Gut-by-giulia-endersIt seems a while since I last reviewed a book here. Today I bring you Gut: The Inside Story of our Body’s Most Underrated Organ by Giulia Enders.

The book does exactly what it says on the tin: tell us about the gut. This is divided into three broad sections. Firstly the mechanics of it all, including going to the toilet and how to do it better. Secondly, the nervous system and the gut, and finally the bacterial flora that help the gut do its stuff.

The writing style seems to be directed at the early to mid-teenager which gets a bit grating in places. Sometimes things end up outright surreal, salmonella wear hats and I still don’t quite understand why. The text is illustrated with jaunty little illustrations.

From the mechanical point of view several things were novel to me: the presence of an involuntary internal sphincter shortly before the well-known external one. The internal sphincter allows “sampling” of what is heading for the outside world giving the owner the opportunity to decide what to do with their external sphincter.

The immune tissue in the tonsillar ring was also a new to me, its job is to sample anything heading towards the gut. This is most important in young children before their immune systems are fully trained. Related to the tonsils, the appendix also contain much immune tissue and has a role in repopulating the bacteria in the large intestine with more friendly sorts of bacteria following a bout of diarrhoea.

The second section, on the nervous system of the gut covers things such as vomiting, constipation and the links between the gut and depression.

The section on the bacterial flora of the gut gathers together some of the stories you may have already heard. For example, the work by Marshall on Helicobactor Pylori and its role in formation of stomach ulcers. What I hadn’t realised is that H. Pylori  is not thought to be all bad. Its benefits are in providing some defence against asthma and autoimmune diseases. Also in this section is toxoplasmosis, the cat-born parasite which can effect rats and humans, making them more prone to risk-taking behaviour.

I was delighted to discover the use to which sellotape is put in the detection of threadworms – potential sufferers are asked to collect threadworm eggs from around the anus using sellotape. I can imagine this is an unusual experience which I don’t intend to try without good reason.

There is a small amount of evangelism for breast-feeding and organic food which I found a little bit grating.

As usual with electronic books I hit the references section somewhat sooner than I expected, and here there is a clash with the casual style of the body of the book. Essentially, it is referenced as a scientific paper would be – to papers in the primary literature.

I don’t feel this book has left me with any great and abiding thoughts but on the other hand learning more about the crude mechanics of my body is at least a bit useful.