Category: Book Reviews

Reviews of books featuring a summary of the book and links to related material

Book review: The Son also Rises by Gregory Clark

downloadThe Son also Rises by Gregory Clark is a book about social mobility, as traced through surnames. Clark prefaces his work by saying that what he is to say might be considered radical and controversial. Other studies of social mobility have find modest “inheritability” between generations. This study finds high levels of inheritability spanning hundreds of years.

The theme for the early chapters is to find some source of high status individuals – be it graduation from prestigious universities such as Oxford, Cambridge or the American Ivy League, membership of professional bodies such as those for doctors or attorneys or from financial records such as occasional tax releases or records of wills (probate). Next a cohort of names is tracked through these systems and their level of incidence is compared against the background level of incidence for that surname. For example, “Smythe” is a relatively rare surname in the general UK population but it is found at a much higher level in records of registered doctors.

The selected cohort of surnames may be from a distinctive ethnic population – i.e. Japanese in America, Native Americans or French settlers. Or it may be selected from a set of high status individuals at a point in time i.e. the Normans who came the England with William the Conqueror, or Swedish nobles.

Clark’s discovery is that for all of these many cohorts across multiple measures of status the persistence over time is strong. The Smythes of 200 years ago had relatively high status then and they still do now. After nearly a 1000 years those with surnames associated with the Norman conquest are still a little over-represented in the intake of Oxford and Cambridge University. Similar behaviour is found for low status groups, Baldrick’s character through the several series of Blackadder is not far from the truth. In both cases these groups are “regressing towards the mean” but it is a long, slow process.

Following these initial demonstrations of social mobility, Clark states his general law which is that the correlation of status over generations is high compared to previously measured parent-child measures and remarkably constant across multiple countries, periods in history and cohorts. The magic number for the correlation is 0.75. He argues that the reason that his estimate is higher than others is that he models social mobility with an underlying constant and a random fluctuation, the methods of calculation for early figures mean that this random fluctuation is much more apparent and brings down the measured social mobility. I don’t feel he demonstrates the origin of this discrepancy very clearly.

Subsequent chapters go on to look at some cases where one really expects deviations from this general rule, in the Indian caste system where low mobility is expected and also in China, where post-revolution is expected to be a time of high social mobility. It turns out that in India, despite laws aimed at reducing caste based discrimination, social mobility is has not improved dramatically. In China social mobility seems to have been little bothered by the revolution. The odd groups that do break the rule of constant social mobility seem to do so by preferential recruitment i.e. in the past in Muslim countries non-Muslims were tolerated but charged a poll tax which meant that lower status/income people were more likely to convert to Islam leaving a more persistently high status non-Muslim population. A second route is by strong preference for “in group” marriage which is seen in the Indian Brahmin caste. It turns out that the surnames identified with British parliamentarians are particularly immobile.

As for the origin of this constant social mobility, Clark ascribes it to what he calls “social competence”. There is a confused discussion of the balance of nature and nurture, not helped by a table where nature and nurture headings are accidently swapped (I think). I believe that technically it is all nurture, and Clark is trying to work out whether it is all about money. It strikes me that your wider family is where you learn about what the possibilities are for you and, while every family has it’s black sheep, the fact that your father, two out of three uncles went to Cambridge University means that your expectation is that you should aspire to that. Your family sets what is “normal”.

I suspect that this is particularly the case for British parliamentarians where there seems to be a lot of siblings (Milibands, Johnsons, Eagles), husband wife (Cooper/Balls) and parent-child (Kinnock, Benn) combinations. Being a politician is an odd sort of job, there is not really a class at school for it, seeing your family working in the “family business” must be a big influence.

“The Son also Rises” is an interesting read but turning it into a 300 page book seems to belabour the point somewhat. I liked the incidental details of the origins of surnames, and the various sources of information on social status.

I got this as a Kindle edition, I wish I’d bought it as a paperback, there are numerous figures, tables and equations which didn’t render at a reasonable size in the first instance.

Book review: The Values of Precision edited by M. Norton Wise

valuesofprecisionThe Values of Precision edited by M. Norton Wise is a collection of essays from the Princeton Workshop in the History of Science held in the early 1990s.

The essays cover the period from the mid-18th century to the early 20th century. The early action is in France and moves to Germany, England and the US as time progresses. The topics vary widely, starting with population censuses, then moving on to measurement standards both linear and electrical, calculating methods and error analysis.

I’ve written some notes on each essay, skip to the end of the bullet points if you want the overview:

  • The first article is about the measurement of population, mainly in pre-revolutionary France. This was spurred by two motivations: firstly, monarchs were increasingly seeing the number of their subjects as a measure of their power and secondly, there was a concern that France was experiencing depopulation. In the 17th century the systematic recording of births, deaths and marriages was mandated by royal direction. In the period after this populations were either estimated from a count of “hearths” or from the number of births. The idea being that you could take either of these indirect measures and multiple them by some factor to get a true measure of population.
  • The second article is by Ken Alder, he of “The Measure of All Things” and is another trip to revolutionary France and their efforts to introduce a metric system of measurement. The revolutionary attempt failed but the system of standards they created prevailed in the middle of the 19th century but not without some effort. Alder highlights the resistance of France to metrification, and also how the revolution bred a will to introduce a rational system based on natural measurements rather than a physical object created by man. He also discusses some of the benefits of the pre-metric system: local control, the ability for workers to take a cut without varying price, connection to effort expended/quality. This last because land was measured in terms of the amount of grain used to seed it or the area one person could harvest in a day – this varies with the quality of the land.
  • Jan Golinski writes on Lavoisier (again from France at the turn of the Revolution) regarding “exactness” and its almost political nature. Lavoisier made much of his exact measurements in the determination of the masses of what are now called hydrogen and oxygen in producing a known mass of water. This caused some controversy since other experimenters of the time saw his claims of exactness in measurement to be mis-used in supporting his theory for chemical reactions. There were reasons to be sceptical of some of his claims, he often cited weighed amounts to more significant figures than were justified by the precision of his measurements and there are signs his recorded measurements are a little too good to be true. These could be seen as the birthing pains of a new way of doing science which didn’t just apply to chemical measurements of the time, but also to surveying and the measurement of population. These days the inappropriateness quoting of more significant figures than are justified by the measurement is drummed into students at an early age.
  • Next we move from France to Germany and a discussion of the method of least squares, and the authority of measurements by Kathryn M. Olesko. Characters such as Legrendre and Laplace had started to put the formal analysis of error and uncertainty in measurement on the map. This work was carried forward by Gauss with the method of least squares, essentially this says that the “true” value of a measurement is that which minimises the squared difference of all the measurements made of that value. It is an idea related to probability, and it is still deeply embedded in how we make measurements today and also how we compare measurement to theory. In common with events in France, the drive for better measurement came in Germany with a drive to standardise weights and measures for the purposes of trade. The action here takes place in the first half of the 19th century.
  • The trek through the 19th century continues with Simon Schaffer’s essay on the work in England and Germany on electrical units with a particular view to establishing whether the speed of light and the speed of propagation of electromagnetic waves were the same. This involved the standardisation of units of electrical resistance. It was work that went on for some time. Interesting from a practicing scientists point of view was the need for the bench scientist and instrument makers to work closely together.
  • The next chapter is a step away from the physical sciences with a look at life insurance and the actuarial profession in the first half of the 19th century. Theodore Porter describes the attitude of this industry to precision and calculation, noting that they fended off attempts to regulate the industry too tightly by arguing that there business could not be reduced to blind calculation. The skill, judgement and character of the actuary was important.
  • The Image of Precision is about Helmholtz’s work on muscle physiology in around 1850, he used an apparatus which showed the extension of a muscle graphically following stimulation, and measured the speed of nerve impulses using similar methods. The graphical method was in some senses less precise than an alternative method but it was a more compelling explanatory tool and provided for better understanding of the phenomena under study.
  • Next up is a discussion of the introduction of so-called “direct-reading” ammeters and voltmeters by Ayrton and Perry in around ~1870. This was an area of some dispute, with physicists claiming that determinations of volts and amps be made by reference to the basic units of length, time and mass. Ayrton and Perry were interested in training electrical engineers whose measurements would be made in environments not conducive to these physicist-preferred measurements. Not conducive in both a technical sense (stray magnetic fields, vibration and so forth) nor in the practical sense (an answer within 1 percent in 10 minutes was far superior to one within 0.5 percent in 2 hours).
  • As we approach the end of the book we learn of Henry Rowland, and his diffraction gratings, made at John Hopkins university. Rowland had toured Europe, and on his return set to making high quality diffraction gratings to measure optical spectra. This is a challenging technical task, to be useful a diffraction grating needs many very closely spaced lines of the same profile. Rowland sent out his diffraction gratings for a nominal price, making no profit, but did not reveal the details of his methods. It took many years for his work to be better, and even longer yet for better diffraction gratings to be available generally.
  • The collection finishes with the construction of mathematical tables, starting with a somewhat philosophical discussion of the limits of calculation but moving onto more pragmatic issues of the calculation and sharing tables. The need for these tables came original with the computationally intensive calculations for determining the longitude by the method of lunar distances. The 19th century saw the growth in mathematical analysis in a range of areas, spreading the need to make mathematical tables. Towards the end of the century machine calculation was used to help build these tables, and do the analysis they supported. Students of my generation will likely just about remember using tables of trigonometric and other functions, these days in my practical work they are entirely replaced by computer calculations done on demand.

There is a lot in here which will speak to those with a training in science, physics in particular. The techniques discussed and the concerns of the day we will recognise in our own training. The essays hold a slight distance from practitioners in this arts but that brings the benefit of a different view. Core to which is the way in which precision in measurement is a social as well as technical affair. To propagate standards of measurement requires the community to build trust in the work of others, this does not happen automatically.

I like this style of presentation, each essay has its own character and interest. The range covered is much larger than one might find in a book length biography, and there is a degree of urgency in the authors getting their key points across in the space allocated.

In this book the various chapters do not overlap in their topics and cover a substantial period in time and space with the editor providing some short linking chapters to tie things together. All in all very well done.

Book Review: Stargazers–Copernicus, Galileo, the Telescope and the Church by Allan Chapman

stargazersIt’s been a while since my last book review here but I’ve just finished reading Stargazers: Copernicus, Galileo, the Telescope and the Church by Allan Chapman.

The book covers the period from the end of the 16th century, the time of Copernicus and Tycho Brahe, to the early 18th century and Bradley’s measurement of stellar aberration passing Galileo, Newton and others on the way. Conceptually this spans the full transition from a time when people believed in a Classical universe with earth at its centre, and stars and planets plastered onto crystal spheres, to the modern view of the solar system with the earth and other planets orbiting the sun.

This development parallels that in Arthur Koestler’s classic book "The Sleepwalkers”, however Chapman’s style is much more readable, his coverage is broader but not so deep. Chapman introduces a wealth of little personal anecdotes and experiments. For instance on visiting Tycho Brahe’s island observatory he recounts a meeting with a local farmer who had in his living room a marked stone from the Brahe’s observatory (which had been dismantled by the locals on Brahe’s death). Brahe was hated by his tenants for his treatment of them, a hate that was handed down through the generations. Illustrations are provided in the author’s own hand, which is surprisingly effective. He discusses his own work in reconstructing historical apparatus and observations.

Astronomy was an active field from well before the start of this period for a couple of reasons: firstly, astrology had been handed down from Classical times as a way of divining the future. To was believed that to improve the accuracy of astrological predictions better data on the locations of heavenly bodies over time was required. Similarly, the Christian Church required accurate astronomical measurement to determine when Easter fell, across increasingly large spans of the Earth.

The period covered by the book marks a time when new technology made increasingly accurate measurements of the heavens possible, and the telescope revealed features such as mountains on the moon, sunspots and the moons of Jupiter visible for the first time. Galileo was a principle protagonist in this revolution.

Amongst scientists there is something of the view that the Catholic Church suppressed scientific progress with Galileo the poster boy for the scientist’s case. Historians of science don’t share this view, and haven’t for quite some time. Looking back on Sleepwalkers, written in 1959 I noted the same thing – the historians view is generally that Galileo brought it on himself in the way he dismissed those that did not share his views in rather offensive terms. Galileo lived in a time when the well-entrenched Classical view of the universe was coming under increased pressure from new observations using new instruments. In some senses it was the collision with the long-held Classical view of the universe which led to his problems, the Church being more committed to this Classical view of the physical universe rather than to anything proposed in Scripture.

The role of the Church in promoting, and fostering science, is something Chapman returns to frequently – emphasising the scientific work that members of the Church did, and also the often good relationships that lay “scientists” of different faiths had with Church authorities.

Chapman introduces some of the lesser known English (and Welsh) contributors to the story. Harriet who made the earliest known sketches of the moon. The Lancashire astronomers, who made the first observations of the transit of Venus. John Wilkins whose meetings were to lead to the foundation of the Royal Society. He also notes the precedent of the Royal College of Physicians, formed in 1518. The novelty of the Royal Society when compared with earlier organisations of similar character was that the Fellows were responsible for new appointments, rather than them being imposed by a patron. This seems to have been an English innovation, repeated in the Oxbridge colleges, and Guilds.

Relating to these English astronomers was the development of precision instruments in England. This seems to have been spurred by the Dissolution of the monasteries. The glut of land, seized by Henry VIII, became available to purchase. The purchase of land meant a requirement for accurate surveying, and legal documents. Hence an industry was born of skilled men wielding high technology to produce maps.

I was distracted by the presence of Martin Durkin in the acknowledgements to this book, he was the architect of “polemical” Channel 4 documentary “The Great Global Warming Swindle”, so it cast doubt in my mind as to whether I should take this book seriously. On reflection Chapman’s position as presented in this book seems respectable, but it is interesting how a short statement in the acknowledgements made me consider this more deeply.

Overall, Stargazers is rather more readable than Sleepwalkers, not quite so single-tracked in it’s defence of the Catholic Church as God’s Philosophers and a different proposition to Fred Watson’s book of the same name, which is all about telescopes.

Book review: Docker Up & Running by Karl Matthias and Sean P. Kane

This review was first published at ScraperWiki.

This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia

This post was first published at ScraperWiki.
learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!