Category: Book Reviews

Reviews of books featuring a summary of the book and links to related material

Book review: Machine Learning in Action by Peter Harrington

machinelearningcover

This post was first published at ScraperWiki.

Machine learning is about prediction, and prediction is a valuable commodity. This sounds pretty cool and definitely the sort of thing a data scientist should be into, so I picked up Machine Learning in Action by Peter Harrington to get an overview of the area.

Amongst the examples covered in this book are:

  • Given that a customer bought these items, what other items are they likely to want?
  • Is my horse likely to die from colic given these symptoms?
  • Is this email spam?
  • Given that these representatives have voted this way in the past, how will they vote in future?

In order to make a prediction, machine learning algorithms take a set of features and a target for a training set of examples. Once the algorithm has been trained, it can take new feature sets and make predictions based on them. Let’s take a concrete example: if we were classifying birds, the birds’ features would include the weight, size, colour and so forth and the target would be the species. We would train the algorithm on an initial set of birds where we knew the species, then we would measure the features of unknown birds and submit these to the algorithm for classification.

In this case, because we know the target – a species of bird – the algorithms we use would be referred to as “supervised learning.” This contrasts “unsupervised learning,” where the target is unknown and the algorithm is seeking to make its own classification. This would be equivalent to the algorithm creating species of birds by clustering those with similar features. Classification is the prediction of categories (i.e. eye colour, like/dislike), alternatively regression is used to predict the value of continuous variables (i.e. height, weight).

Machine learning in Action is divided into four sections that cover key elements and “additional tools” which includes algorithms for dimension reduction and MapReduce – a framework for parallelisation. Dimension reduction is the process of identifying which features (or combination of features) are essential to a problem.

Each section includes Python code that implements the algorithms under discussion and these are applied to some toy problems. This gives the book the air of Numerical Recipes in FORTRAN, which is where I cut my teeth on numerical analysis. The mixture of code and prose is excellent for understanding exactly how an algorithm works, but its better to use a library implementation in real life.

The algorithms covered are:

  • Classification – k-Nearest Neighbours, decision trees, naive Bayes, logistic regression, support vector machines, and AdaBoost;
  • Regression – linear regression, locally weighted linear regression, ridge regression, tree-based regression;
  • Unsupervised learning – k-means clustering, apriori algorithm, FP-growth;
  • Additional tools – principle component analysis and singular value decomposition.

Prerequisites for this book are relatively high: it assumes fair Python knowledge, some calculus, probability theory and matrix algebra.

I’ve seen a lot of mention of MapReduce without being clear what it was. Now I am more clear: it is a simple framework for carrying out parallel computation. Parallel computing has been around quite some time, the problem has always been designing algorithms that accommodate parallelisation (i.e. allow problems to be broken up into pieces which can be solved separately and then recombined). MapReduce doesn’t solve this problem but gives a recipe for what is required to run on commodity compute cluster.

As Harrington says: do you need to run MapReduce on a cluster to solve your data problem? Unless you are an operation on the scale of Google or Facebook then probably not. Current, commodity desktop hardware is surprisingly powerful particularly when coupled with subtle algorithms.

This book works better as an eBook than paper partly because the paper version is black and white and some figures require colour but the programming listings are often images and so the text remains small.

Book review: Data Visualization: a successful design process by Andy Kirk

datavisualization_andykirk

This post was first published at ScraperWiki.

My next review is of Andy Kirk’s book Data Visualization: a successful design process. Those of you on Twitter might know him as @visualisingdata, where you can follow his progress around the world as he delivers training. He also blogs at Visualising Data.

Previously in this area, I’ve read Tufte’s book The Visual Display of Quantitative Information and Nathan Yau’s Visualize ThisTufte’s book is based around a theory of effective visualisation whilst Visualize This is a more practical guide featuring detailed code examples. Kirk’s book fits between the two: it contains some material on the more theoretical aspects of effective visualisation as well as an annotated list of software tools; but the majority of the book covers the end-to-end design process.

Data Vizualisation introduced me to Anscombe’s Quartet. The Quartet is four small datasets, eleven (x,y) coordinate pairs in each. The Quartet is chosen so the common statistical properties (e.g. mean values of x and y, standard deviations for same, linear regression coefficients) for each set are identical, but when plotted they look very different. The numbers are shown in the table below.

anscombesdata

Plotted they look like this:

anscombequartetAside from set 4, the numbers look unexceptional. However, the plots look strikingly different. We can easily classify their differences visually, despite the sets having the same gross statistical properties. This highlights the power of visualisation. As a scientist, I am constantly plotting the data I’m working on to see what is going on and as a sense check: eyeballing columns of numbers simply doesn’t work. Kirk notes that the design criteria for such exploratory visualisations are quite different from those highlighting particular aspects of a dataset, more abstract “data art” presentations, or a interactive visualisations prepared for others to use.

In contrast to the books by Tufte and Yau, this book is much more about how to do data visualisation as a job. It talks pragmatically about getting briefs from the client and their demands. I suspect much of this would apply to any design work.

I liked Kirk’s “Eight Hats of data visualisation design” metaphor; which name the skills a visualiser requires: Initiator, Data Scientist, Journalist, Computer Scientist, Designer, Cognitive Scientist, Communicator and Project Manager. In part, this covers what you will require to do data visualisation, but it also gives you an idea of whom you might turn to for help  –  someone with the right hat.

The book is scattered with examples of interesting visualisations, alongside a comprehensive taxonomy of chart types. Unsurprisingly, the chart types are classified in much the same way as statistical methods: in terms of the variable categories to be displayed (i.e. continuous, categorical and subdivisions thereof). There is a temptation here though: I now want to make a Sankey diagram… even if my data doesn’t require it!

In terms of visualisation creation tools, there are no real surprises. Kirk cites Excel first, but this is reasonable: it’s powerful, ubiquitous, easy to use and produces decent results as long as you don’t blindly accept defaults or get tempted into using 3D pie charts. He also mentions the use of Adobe Illustrator or Inkscape to tidy up charts generated in more analysis-oriented packages such as R. With a programming background, the temptation is to fix problems with layout and design programmatically which can be immensely difficult. Listed under programming environments is the D3 Javascript library, this is a system I’m interested in using  –  having had some fun with Protovis, a D3 predecessor.

Data Visualization works very well as an ebook. The figures are in colour (unlike the printed book) and references are hyperlinked from the text. It’s quite a slim volume which I suspect compliments Andy Kirk’s “in-person” courses well.

Book review: The Dinosaur Hunters by Deborah Cadbury

DinosaurHuntersA rapid change of gear for my book reviewing: having spent several months reading “The Eighth Day of Creation” I have completed “The Dinosaur Hunters” by Deborah Cadbury in only a couple of weeks. Is this a bad thing? Yes, and no – it’s been nice to read a book that rattles along at a good pace, is gripping and doesn’t have me leaping to make notes at every page – the downside is that I feel I have consumed a literary snack rather than a meal.

The Dinosaur Hunters covers the initial elucidation of the nature of large animal fossils, principally of dinosaurs, from around the beginning of the 19th century to just after the publication of Darwin’s “Origin of the Species” in 1859. The book is centred around Gideon Mantell (1790-1852) who first described the Iguanodon and was an expert in the geology of the Weald, at the same time running a thriving medical practice in his home town of Lewes. Playing the part of Mantell’s nemesis is Richard Owen (1804-1892), who formally described the group of species, the Dinosauria, and was to be the driving force in the founding of the Natural History Museum in the later years of the 19th century. Smaller parts are played by Mary Anning (1799-1847), fossil collector based in Lyme Regis; William Buckland (1784-1856) who described Megalosaurus – the first of the dinosaurs and spent much of his life trying to reconcile his Christian faith with new geological findings; George Cuvier (1769-1832) the noted French anatomist who related fossil anatomy to modern animal anatomy and identified the existence of extinctions (although he was a catastrophist who saw this as evidence of different epochs of extinction rather than a side effect of evolution); Charles Lyell (1897-1875) a champion of uniformitarianism (the idea that the modern geology is the result of processes visible today continuing over great amounts of time); Charles Darwin (1809-1882) who really needs no introduction, and Thomas Huxley (1825-1895) a muscular proponent of Darwin’s evolutionary theory.

For me a recurring theme was that of privilege and power in science, often this is portrayed as something which disadvantaged women but in this case Mantell is something of a victim too, as was William Smith as described in “The Map that Changed the World”. Mantell was desperate for recognition but held back by his full-time profession as a doctor in a minor town and his faith that his ability would lead automatically to recognition. Owen, on the other hand, with similar background (and prodigious ability) went first to St Bartholomew’s hospital and then the Royal College of Surgeon’s where he appears to have received better patronage but in addition was also brutal and calculating in his ambition. Ultimately Owen over-reached himself in his scheming, and although he satisfied his desire to create a Natural History Museum, in death he left little personal legacy – his ability trumped by his dishonesty in trying to obliterate his opponents.

From a scientific point of view the thread of the book is from the growing understanding of stratigraphy i.e. the consistent sequence of rock deposits through Great Britain and into Europe; the discovery of large fossil animals which had no modern equivalent; the discovery of an increasing range of these prehistoric remnants each with their place in the stratigraphy and the synthesis of these discoveries in Darwin’s theory of evolution. Progress in the intermediate discovery of fossils was slow because in contrast to the the early fossils of marine species such as icthyosaurus and plesiosaurus which were discovered substantially intact later fossils of large land animals were found fragmented in Southern England, which made identifying the overall size of such species and even the numbers of species present in your pile of fossils difficult.

These scientific discoveries collided with a social thread which saw the clergy deeply involved in scientific discovery at the beginning, becoming increasingly discomforted with the account of the genesis of life in Scripture being incompatible with the findings in the stone. This ties in with a scientific community trying to make their discoveries compatible with Scripture and what they perceived to be the will of God with the schism between the two eventually coming to a head by the publication of Darwin’s Origin of Species.

Occasionally the author drops into a bit of first person narration which I must admit to finding a bit grating, perhaps because for people long dead it is largely inference. I’d have been very happy to have chosen this book for a long journey or a holiday, I liked the wider focus on a story rather than an individual.

References

My Evernotes

Book review: The Eighth Day of Creation by Horace Freeland Judson

EighthDayMy reading moves seamlessly from the origins of cosmology (in Koestler’s Sleepwalkers) to the origins of molecular biology in “The Eighth Day of Creation” by Horace Freeland Judson. The book covers the revolution in biology starting with the elucidation of the structure of DNA through to how this leads to the synthesis, by organisms, of proteins – this covers a period from just before the Second World War to the early 1960s although in the Epilogue and Afterwords. Judson comments on the period up to the mid-nineties. Although the book does provide basic information on the core concepts (What is DNA? What is a protein?), I suspect it requires a degree of familiarity with these ideas to make much sense on a casual reading – the same applies to this blog post.

The first third or so of the book covers the elucidation of the structure of DNA. Three groups were working on this problem – that of Linus Pauling in the US, Franklin and Wilkins at Kings College in London and Crick and Watson in Cambridge. Key to the success of Crick and Watson was their collaboration: a willingness to talk to people who knew stuff they needed to know, and piecing the bits together. The structural features of their model were the helix form (this wasn’t news), specific and strong hydrogen bonding between bases, and the presence of two DNA chains (running in opposite directions). On the whole this wasn’t a new story to me, although I wasn’t familiar with the surrounding work which established DNA as the genetic material. Judson returns to the part Rosalind Franklin in the discovery in one of the Afterwords. It has been said that Franklin was greatly wronged over the discovery of DNA, but Judson does not hold this view and I tend to agree with him. The core of the problem is that the Nobel Prize is not awarded posthumously, and with her death at 37 from cancer, Franklin therefore missed out. Watson’s book The Double Helix was a rather personalised view of the characters involved most of whom were alive to carry out damage limitation, whilst Franklin was not – so here she was poorly treated but by Watson rather than a whole community of scientists. Perhaps the thing that said the most to me about the situation is that after she was diagnosed with cancer she stayed with Cricks at their home.

In parallel with the elucidation of the structure of the DNA work had been ongoing with understanding protein synthesis and genetics in viruses and bacteria. This included both how information was coded into DNA, with much effort expended in trying to establish overlapping codes. There are 20 amino acids and four bases in DNA, so three base pairs are required to specify an amino acid if the amino acid sequence is to be unconstrained but it was conceivable that two consecutive amino acids are coded by fewer than 6 base pairs but in this case there is a restriction on the possible amino acid sequences. This area was initiated by the physicist, George Gamow. I struggle a bit to see how it gained so much traction, this type of model was quickly ruled out by consideration of the amino acid sequences that we being established for proteins at the time. It turns out that amino acids are coded by three consecutive base pairs with redundancy (so several different base pair triplets code for the same amino acid). Also covered was the mechanism by which data passed from DNA to the ribosomes where protein synthesis takes place, important here are adaptor molecules which carry the appropriate amino acid to the site of synthesis.

Compared to the structure of DNA this work was a long difficult slog, involving intricate experiments with bacteria, bacteriophage viruses, bacterial sex, ultracentrifugation, chromatography and radiolabelling.

The final part of the book is on the elucidation of the structure of proteins, this was done using x-ray crystallography with the very first clear scattering patterns measured in the 1930s and the first full elucidation made in the late fifties. X-ray crystallography of proteins, containing many thousands of atoms is challenging. Fundamentally there is a issue, the “phase problem”, which means you don’t have quite enough information to determine the structure from the scattering pattern. This issue was resolved by heavy atom labelling, here you try to chemically attach a heavy atom such as mercury to your protein then compare the scattering pattern of this modified protein with that of the unmodified protein, which resolves the phase problem. Nowadays measuring the thousands of spots in an x-ray scattering pattern and carrying out the thousands and thousands of calculations required to resolve the structure is relatively straightforward but in the early days it was a massive manual labour.

As well as resolving structure a key discovery was made regarding the mode of action of proteins: essentially they work as adaptors between chemical distinct systems – when a molecule binds to one site on a protein it effects the ability of another type of molecule to bind to another site on the protein through changes in the protein structure induced by the first molecule’s binding. This feature opens up huge possibilities for cell biology – in the absence of this feature interactions between chemical systems can only occur if the participants in those systems interact with each other chemically.

It isn’t something I’d really appreciated properly but molecular biologists are quite organised in the organisms that they generally agree to work on. The truth is that there are uncountably many viruses and so to aid the progress of science one needs to select which ones to study: E. Coli, the T series bacteriophages, C. Elegans, D. Melanogaster and more recently the zebrafish, they almost play the part of an extra author.

Molecular biology was apparently dominated by physicists, I must admit I found this confusing in the past but Judson highlights the field as defined by its practioners: biochemistry is about energy and matter (and typically small molecules), molecular biology is about information (and typically macromolecules) – a more natural home for physicists.

I found the first and third parts an enjoyable read, my scientific background is in scattering so the technical material was at least familiar the central section on genetics I found fascinating but a bit of a slog. I’m somewhat in awe of the complexity of the experiments (and their apparent difficulty).

Looking back on my earlier book reviews, I read my comment on R.J. Evan’s book on historiography that history is a literary exercise as well as anything else, as a trained scientist this was something of an alien concept but in common with Koestler’s book the style of this book shines through.

 

Footnotes

My Evernotes

Book review: The Sleepwalkers: A History of Man’s Changing Vision of the Universe by Arthur Koestler

Sleepwalkers_ArthurKoestler.Another result of my plea for reading suggestions on twitter; this is a review and summary of Arthur Koestler’s book “The Sleepwalkers: A History of Man’s Changing Vision of the Universe”. The book is a history of cosmology running from Pythagoras, in the 6th century BC, to Galileo who spanned the end of the 16th century, just touching lightly on Newton. It traces a revolution from a time when the cosmos, beyond the earth, was considered different, stable and perfect, to a time when it was shown to be subject to earthly physics, be changeable and not perfect by any reasonable definition.

Kuhn’s language of paradigm shifts seems rather overused to me but here is an example of a true paradigm shift. The sleepwalkers in the title refers to the idea that the protagonists didn’t really know where they were headed with their ideas and quite often were lucky with errors which cancelled each other out.

The book starts with a cursory look at Babylonian and early Greek astronomy; despite considerable observational acumen their models of the universe were outright mythical. The Pythagoranean Brotherhood although in many senses still mystical started to think about the physics of the universe. I have a tendency to think of the ancient Greeks as one blob but as the book makes clear there is a huge span of time, and outlook, between Pythagoras, Aristotle and Plato and Ptolemy. Koestler is quite clearly disappointed with the Greeks: they make a promising start with Pythagoras, Aristarchus developed a heliocentric model for the solar system and then with Plato, Aristotle and Ptolemy they regress back to a geocentric model.

Following on from the Greeks the Middle Ages are covered, James Hannam in his book “God’s Philosophers” has covered why this period wasn’t all that bad in terms of intellectual development. Koestler is less sympathetic, his key accusations are that they philosophers of the middle ages were in thrall to the later Greeks and furthermore there were elements of Christian theology that abjured the pleasure of knowledge for knowledge’s sake.

After these preliminaries, Koestler turns to the core of his work: the cosmological developments of Copernicus, Tycho Brahe, Johannes Kepler and Galileo Galilei.

The model of the universe handed down from the ancient Greeks was one of circles (often referred to in this context as epicycles), they believed that motion in a circle was perfect, that the heavens were a separate, perfect realm and that therefore all motion in the heavens must be based on circular motion. Further, the model dominating at the end of their period, held that the earth lay at the centre of these circular motions. The only problem with this model is that it doesn’t fit well the observed motions of the sun, moon, Mercury, Venus, Mars, Jupiter and Saturn – the observable solar system which lay against an unchanging starry background. Or rather you can get a rough fit at the expense of stacking together a great number of epicycles – something like 50.

Copernicus’ contribution, published on his death in 1543, was to put the sun back at the centre of the universe. Copernicus led a rather uneventful life, was no sort of astronomical observer and only published his thesis at the end of his life at the strong urging of Georg Joachim Rheticus. He’d discussed his model fairly freely during his life, and his reasons for not publishing were more to do with fear of ridicule from his contemporaries rather than theological pressure. After his death his work, with the exception of the astronomical tables, sank into obscurity partly because it was a difficult read and partly because he managed to ostracise his former cheerleader, Rheticus. Copernicus’ model still holds to the epicycles of the Greeks, and only marginally reduces the complexity of the model.

Next up comes Johannes Kepler, interspersed with Tycho Brahe. Brahe was an astronomical observer and nobleman, funded very well by the Danish king; given his own island Hveen where he built his observatory. As a keen astrologer he began his observation programme when he found a conjunction of Jupiter and Saturn was poorly predicted by current astronomical tables – how can you cast an accurate fortune under these circumstances?

Kepler was a theoretician rather than an observer but also a keen astrologer. I emphasise this because these days astrology is not held in high regard but it is the father of observational astronomy. He had started to develop a model of the solar system based on the Platonic solids – something of a mystical exercise but realised he needed better data to support his model. Brahe was the man with the data, Kepler was only just in time though – he travelled to work with Brahe when Brahe moved to Prague less than 2 years later Brahe was dead. Nowadays we know Kepler for his three laws of planetary motion – it’s worth noting that Kepler’s laws are labelled retrospectively.)

He left copious records of his progress which Koestler traces in great detail, Kepler’s struggle to recognise that planetary orbits were ellipses was heroic and has something of a pantomime air to it – “They’re right in front of you!”. His approach was unprecedented in the sense that he sought to accurately model the very best, most recent measurements. Kepler also made some attempts at a physical model to describe the motions but ultimately he is remembered for the detailed description of their motion. Since it is not central to his theme, Koestler makes only passing reference to Kepler’s work on optics.

The penultimate figure in the story is Galileo, despite Kepler’s best efforts Galileo pretty much ignored him. Galileo gets quite short shrift from Koestler who feels that he brought his troubles with the Catholic Church upon himself. Reading this account his position is not unreasonable. Galileo’s two big contributions to the story are his promotion and use of the telescope, and his work on the motion of terrestrial bodies, the generalisation of which and application to the solar system was Newton’s great triumph. Cosmologically he was only later in his life a supporter of the somewhat retro Copernican model which was a cul-de-sac in terms of theoretical developments. At the time the Catholic Church, particularly the Jesuits, were interested in astronomy and not particularly hardline about the interpretation of Scripture to fit observations. Galileo wound them up both by claiming all newly observed celestial phenomena as his own and by putting the words of the Pope in the mouth of an idiot in one of his Dialogues.

This highlights two of the wider themes that Koestler brings to his book. At one point he describes his cast of characters as “moral dwarves”, he states this is relative to their scientific achievements but returns to this theme in the epilogue where he feels that our scientific developments have not been matched by our spiritual development. The second is the schism between science and the Church that began in this period, Koestler seems to put much of the blame for this on Galileo’s head feeling that it is by no means inevitable. In the epilogue he also draws a comparison between biological evolution and scientific developments, highlighting specifically that there are long periods of not that much happening and many diversions from the “true” path.

The book finishes with a brief mention of Newton’s synthesis of Kepler’s laws and Galileo’s dynamics to produce a model of the solar system which is close to that which we hold today.

This really is a rollicking good read! This is a relatively old book, published in 1959 and one might anticipate that it has not fully caught up with modern historiography however a brief look around the internet suggests that he is not criticised in any great sense. Koestler does tend to focus on a limited number of “great” individuals and goes for “firsts” but this perhaps is what makes it a good read.

Footnotes

My Evernotes for the book are here, last page of the book at the top!