Author's posts
Jul 06 2015
Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia
This post was first published at ScraperWiki.
Apache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.
The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.
Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.
In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.
All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.
The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.
Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.
The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!
Jul 05 2015
Portinscale 2015
We had an abortive trip to Portinscale in the Lake District for our summer holiday last year, ended prematurely by illness. This year we’re back and have improved greatly on last years performance! Portinscale is just outside Keswick, a small town at the head of Derwentwater. In the past we would have stayed a little further from civilisation so we could go for longish walks from the door but with 3 year old Thomas a bunch of attractions in easy distance is preferable.
Day 1 – Sunday
Rather than fit packing and driving the relatively short distance to Portinscale from Chester into a day, whilst simultaneously meeting the arrival time requirements, we travelled up on Sunday morning. In the afternoon we went to Whinlatter Forest Park, a few miles up the road. The entrance is guarded by a fine sculpture of an osprey.
It has an extensive collection of trails for pedestrians and cyclists. A Go Ape franchise for people who like swinging from trees, some Gruffalo / Superworm themed trails for children. And a wild play area featuring Thomas’ favourite thing – a pair of Archimedes Screws:
There’s also a very nice cafe. We visited Whinlatter several times of an afternoon.
Day 2 – Monday
We went to Mirehouse in the morning, a lakeside estate with a smallish garden and a rather pleasant walk down to Bassenthwaite Lake.
There’s a fine view from the lake down towards Keswick.
In the afternoon we went to the Pencil Museum in Keswick, not a large attraction but Thomas liked Drew the giant and we got 5 pencils for an outlay of £3.
Day 3 – Tuesday
In the morning we went to Threlkeld Mining Museum. Its full of cranes and various bits of mining machinery from the past 100 years or so. There is a narrow gauge railway line which runs half a mile or so to the head of the quarry from the visitor centre. Threlkeld is not a slick affair but it is great fun for a small child fond of cranes, and the volunteers are obviously enthused by what they are doing. To be honest, I’m rather fond of industrial archaeology too!
Basically, they collect cranes.
All of which are in some degree of elegant decay
For our visit they were running a little diesel train:
In the afternoon we walked down to Nichols End, a marina on Derwentwater close by our house in Portinscale.
Day 4 – Wednesday
My records show that we last visited Maryport 15 years ago. It has the benefit of being close to Keswick – only half an hour or so away. We enjoyed a brief paddle in the sea, on a beach of our own before heading to the small aquarium in town.
Whinlatter Forest Park once again in the afternoon.
Day 5 – Thursday
On leaving the house we thought we would be mooching around Keswick whilst our car was being seen to for “mysterious dripping”, as it was Crosthwaite Garage instantly diagnosed an innocuous air conditioning overflow. So we headed off to Lodore Falls, alongside Derwentwater before returning to Hope Park in Keswick.
Thomas declared the gently dripping woods on the way to Lodore Falls to be “amazing”:
The falls themselves are impressive enough, although the view is a little distant when you are with a small child, who coincidently loves waterfalls and demands their presence on every walk:
Hope Park was busy, but it is a pretty lakeside area with formal gardens and golf a little back from the shore.
In the afternoon we visited Dodd Wood, which is just over the road from Mirehouse, where we did a rather steep walk.
Day 6 – Friday
On our final day we visited Allan Bank in Grasmere, this is a stealth National Trust property, formerly home to William Wordsworth and one of the founders of the National Trust, Canon Rawnsley. “Stealth” because it is barely advertised or sign posted, and is run in manner far more relaxed than any other National Trust place I’ve visited. It’s a smallish house:
With glorious views:
The house was damaged by fire a few years ago, and has only really been refurbished in as far as making it weather proof. Teas and coffees are available on unmatching crockery for a donation (you pay for cake though), and you’re invited to take them where you please to drink. There is a playroom ideally suited to Thomas’ age group, along with rooms Wordsworth and Rawnsley occupied upstairs.
It has the air of a hippy commune, and it’s sort of glorious.
Outside the grounds are thickly wooded on a steep slope, there is a path approximately around the perimeter which takes in the wild woods, several dens and some lovely views.
We glimpsed a red squirrel in the woods.
As Thomas wrote, it was "”Fun”!
In the afternoon a final trip to Whinlatter Forest Park.
We left on Saturday amidst heavy early morning rain, the only serious daytime rain of the holiday – probably the best week of weather I’ve had in the Lake District!
Jul 05 2015
Book review: Your Inner Fish by Neil Shubin
I’m holiday so I’ve managed some more reading! This time Your Inner Fish by Neil Shubin. As recommended by my colleague, David Jones, at ScraperWiki.
This is ostensibly a story of a particular distant ancestor of humans, the first to walk on land 375 million years ago, but in practice it is broader than that. It is more generally about what it is to be a modern palaeontologist and taxonomy – the classification of living organisms.
Your Inner Fish is a personal account based around the work Shubin and his colleagues did in discovering the Tiktaalik species, the first walker, in the high Canadian Arctic. It turns out the distinguishing features of such animals are the formation of shoulders and a neck, underwater a fish can easily reorient its whole body to get its head facing the right way, on land a neck to move the head independently and shoulders to mount the front legs become beneficial. Shubin hypotheses that animals such as Tiktaalik evolved to walk on dry land to evade ever larger and more aggressive aquatic predators.
Shubin recounts the process that led him to the Arctic, starting with his earlier fossil hunting in road cuts in Pennsylvania. The trick to fossil hunting being finding bedrock of the right age being exposed in moderate amounts. Road cuts are a second best in the this instance, being rather small in scale. Palaeontologists find their best hunting grounds in deserts and the barren landscape of the north. Finding the right site is a combination of identifying where rocks of the right age are likely to be exposed and knowing whether someone has looked there already.
Once you are in the field, the tricky part comes: finding the fossils. This is a skill akin to being able to resolve a magic eye puzzle. This is a skill which is learnt practically in the field rather than theoretically in the classroom. I’m struck by how small some of the most important fossil sites are, Shubin shows a photo of the Tiktaalik site where 6 people basically fill it. The Walcott Quarry in the Burgess Shale is similarly compact.
The central theme of the book is the one-ness of life, in the sense that humans share a huge amount of machinery with all living things to do with the business of building a body. These days the focus of such interest is on DNA, and the similarity of genes and the proteins they encode across huge spans of the tree of life. In earlier times these similarities were identified in developmental processes and anatomy. It is significant that researchers such as Shubin span the fossil, development and genetic domains.
Anatomically fish, lizards, mammals and birds represent the reshuffling of the same components. The multiple jaw bones found in sharks and skates turn into the bones of the inner ear in mammals. The arches which form gills in fish morph and adapt in mammals to leave a weird layout of nerves in the face and skull. These similarities in gross anatomical features are reflected in the molecular machinery which drives development, the formation of complex bodies from a single fertilised cell. Organiser molecules are common across vertebrates.
It’s worth noting the contribution of Hilde Mangold to the development story, her supervisor Hans Spemann won the 1935 Nobel Prize for medicine based in part on the work differentiation in amphibian embryos she had presented in her 1923 thesis. She died at the age of 26 in 1924 as the result of an explosion in her apartment building. Nobel Prizes are only awarded to the living.
Why study this taxonomy? The reasons are two-fold, there is the purely intellectual argument of “because it is there”. The shared features of life are one of the pieces of evidence underpinning the theory of evolution. The second reason is utilitarian, linking all of life into a coherent structure gives us a better understanding of our own bodies, and how to fix them if they go wrong.
As examples of our faulty body Shubin highlights hiccups and hernias. Hiccups because the reflexes leading to hiccups are the descendants of the reflexes of tadpoles which allowed them to breathe through gills as well as lungs. Hernias because the placement of the testes outside the abdomen is an evolution from our fish ancestors who kept gonads internally – external placement is a botched job which leads to a weakness in the abdomen wall, particularly in men.
This book is shorter and more personal than Richard Dawkins’ and Stephen Jay Gould’s work in similar vein.
I liked it.
Jun 29 2015
Book review: Gut by Giulia Enders
It seems a while since I last reviewed a book here. Today I bring you Gut: The Inside Story of our Body’s Most Underrated Organ by Giulia Enders.
The book does exactly what it says on the tin: tell us about the gut. This is divided into three broad sections. Firstly the mechanics of it all, including going to the toilet and how to do it better. Secondly, the nervous system and the gut, and finally the bacterial flora that help the gut do its stuff.
The writing style seems to be directed at the early to mid-teenager which gets a bit grating in places. Sometimes things end up outright surreal, salmonella wear hats and I still don’t quite understand why. The text is illustrated with jaunty little illustrations.
From the mechanical point of view several things were novel to me: the presence of an involuntary internal sphincter shortly before the well-known external one. The internal sphincter allows “sampling” of what is heading for the outside world giving the owner the opportunity to decide what to do with their external sphincter.
The immune tissue in the tonsillar ring was also a new to me, its job is to sample anything heading towards the gut. This is most important in young children before their immune systems are fully trained. Related to the tonsils, the appendix also contain much immune tissue and has a role in repopulating the bacteria in the large intestine with more friendly sorts of bacteria following a bout of diarrhoea.
The second section, on the nervous system of the gut covers things such as vomiting, constipation and the links between the gut and depression.
The section on the bacterial flora of the gut gathers together some of the stories you may have already heard. For example, the work by Marshall on Helicobactor Pylori and its role in formation of stomach ulcers. What I hadn’t realised is that H. Pylori is not thought to be all bad. Its benefits are in providing some defence against asthma and autoimmune diseases. Also in this section is toxoplasmosis, the cat-born parasite which can effect rats and humans, making them more prone to risk-taking behaviour.
I was delighted to discover the use to which sellotape is put in the detection of threadworms – potential sufferers are asked to collect threadworm eggs from around the anus using sellotape. I can imagine this is an unusual experience which I don’t intend to try without good reason.
There is a small amount of evangelism for breast-feeding and organic food which I found a little bit grating.
As usual with electronic books I hit the references section somewhat sooner than I expected, and here there is a clash with the casual style of the body of the book. Essentially, it is referenced as a scientific paper would be – to papers in the primary literature.
I don’t feel this book has left me with any great and abiding thoughts but on the other hand learning more about the crude mechanics of my body is at least a bit useful.
Jun 15 2015
Book review: Mastering Gephi Network Visualisation by Ken Cherven
This review was first posted at ScraperWiki.
A little while ago I reviewed Ken Cherven’s book Network Graph Analysis and Visualisation with Gephi, it’s fair to say I was not very complementary about it. It was rather short, and had quite a lot of screenshots. It’s strength was in introducing every single element of the Gephi interface. This book, Mastering Gephi Network Visualisation by Ken Cherven is a different, and better, book.
Networks in this context are collections of nodes connected by edges, networks are ubiquitous. The nodes may be people in a social network, and the edges their friendships. Or the nodes might be proteins and metabolic products and the edges the reaction pathways between them. Or any other of a multitude of systems. I’ve reviewed a couple of other books in this area including Barabási’s popular account of the pervasiveness of networks, Linked, and van Steen’s undergraduate textbook, Graph Theory and Complex Networks, which cover the maths of network (or graph) theory in some detail.
Mastering Gephi is a practical guide to using the Gephi Network visualisation software, it covers the more theoretical material regarding networks in a peripheral fashion. Gephi is the most popular open source network visualisation system of which I’m aware, it is well-featured and under active development. Many of the network visualisations you see of, for example, twitter social networks, will have been generated using Gephi. It is a pretty complex piece of software, and if you don’t want to rely on information on the web, or taught courses then Cherven’s books are pretty much your only alternative.
The core chapters are on layouts, filters, statistics, segmenting and partitioning, and dynamic networks. Outside this there are some more general chapters, including one on exporting visualisations and an odd one on “network patterns” which introduced diffusion and contagion in networks but then didn’t go much further.
I found the layouts chapter particularly useful, it’s a review of the various layout algorithms available. In most cases there is no “correct” way of drawing a network on a 2D canvas, layout algorithms are designed to distribute nodes and edges on a canvas to enable the viewer to gain understanding of the network they represent. From this chapter I discovered the directed acyclic graph (DAG) layout which can be downloaded as a Gephi plugin. Tip: I had to go search this plugin out manually in the Gephi Marketplace, it didn’t get installed when I indiscriminately tried to install all plugins. The DAG layout is good for showing tree structures such as organisational diagrams.
I learnt of the “Chinese Whispers” and “Markov clustering” algorithms for identifying clusters within a network in the chapter on segmenting and partitioning. These algorithms are not covered in detail but sufficient information is provided that you can try them out on a network of your choice, and go look up more information on their implementation if desired. The filtering chapter is very much about the mechanics of how to do a thing in Gephi (filter a network to show a subset of nodes), whilst the statistics chapter is more about the range of network statistical measures known in the literature.
I was aware of the ability of Gephi to show dynamic networks, ones that evolved over time, but had never experimented with this functionality. Cherven’s book provides an overview of this functionality using data from baseball as an example. The example datasets are quite appealing, they include social networks in schools, baseball, and jazz musicians. I suspect they are standard examples in the network literature, but this is no bad thing.
The book follows the advice that my old PhD supervisor gave me on giving presentations: tell the audience what you are go to tell them, tell them and then tell them what you told them. This works well for the limited time available in a spoken presentation, repetition helps the audience remember, but it feels a bit like overkill in a book. In a book we can flick back to remind us what was written earlier.
It’s a bit frustrating that the book is printed in black and white, particularly at the point where we are asked to admire the blue and yellow parts of a network visualisation! The referencing is a little erratic with a list of books appearing in the bibliography but references to some of the detail of algorithms only found in the text.
I’m happy to recommend this book as a solid overview of Gephi for those that prefer to learn from dead tree, such as myself. It has good coverage of Gephi features, and some interesting examples. In places it is a little shallow and repetitive.
The publisher sent me this book, free of charge, for review.