Book review: Big data by Viktor Mayer-Schönberger and Kenneth Cukier

BigData

This review was first published at ScraperWiki.

We hear a lot about “Big Data” at ScraperWiki. We’ve always been a bit bemused by the tag since it seems to be used indescriminately. Just what is big data and is there something special I should do with it? Is it even a uniform thing?

I’m giving a workshop on data science next week and one of the topics of interest for the attendees is “Big Data”, so I thought I should investigate in a little more depth what people mean by “Big Data”. Hence I have read Big Data by Viktor Mayer-Schönberger and Kenneth Cukier, subtitled “A Revolution That Will Transform How We Live, Work and Think” – chosen for the large number of reviews it has attracted on Amazon. The subtitle is a guide to their style and exuberance.

Their thesis is that we can define big data, in contrast to earlier “little data”, by three things:

  • It’s big but not necessarily that big, their definition for big is that n = all. That is to say that in some domain you take all of the data you can get hold of. They use as one example a study on bout fixing in sumo wrestling, based on  data on 64,000 bouts – which would fit comfortably into a spreadsheet. Other data sets discussed are larger such as credit card transaction data, mobile telephony data, Google’s search query data…;
  • Big data is messy, it is perhaps incomplete or poorly encoded. We may not have all the data we want from every event, it may be encoded using free text rather than strict categories and so forth;
  • Working with big data we must discard an enthusiasm for causality and replace it with correlation. Working with big data we shouldn’t mind too much if our results are just correlations rather than explanations (causation);
  • An implicit fourth element is that the analysis you are going to apply to your big data is some form of machine learning.

I have issues with each of these “novel” features:

Scientists have long collected datasets and done calculations that are at the limit (or beyond) their ability to process the data produced. Think protein x-ray crystallography, astronomical data for navigation, the CERN detectors etc etc. You can think of the decadal censuses run by countries such as the US and UK as n = all. Or the data fed to the early LEO computer to calculate the deliveries required for each of their hundreds of teashops. The difference today is that people and companies are able to effortlessly collect a larger quantity of data than ever before. They’re able to collect data without thinking about it first. The idea of n = all is not really a help. The straw man against which it is placed is the selection of a subset of data by sampling.

They say that big data is messy implying that what went before was not. One of the failings of the book is their disregard for those researchers that have gone before. According to them the new big data analysts are comfortable with messiness and uncertainty, unlike those fuddy-duddy statisticians! Small data is messy, scientists and statisticians have long dealt with messy and incomplete data.

The third of their features: we must be comfortable with correlation rather than demand causation. There are many circumstances where correlation is OK, such as when Amazon uses my previous browsing and purchase history to suggest new purchases but the area of machine learning / data mining has long struggled with messiness and causality.

This is not to say nothing has happened in the last 20 or so years regarding data. The ubiquity of computing devices, cheap storage and processing power and the introduction of frameworks like Hadoop are all significant innovations in the last 20 years. But they grow on things that went before, they are not a paradigm shift. Labelling something as ‘big data’, so ill-defined, provides no helpful insight as to how to deal with it.

The book could be described as the “What Google Did Next…” playbook. It opens with Google’s work on flu trends, passes through Google’s translation work and Google Books project. It includes examples from many other players but one gets the impression that it is Google they really like. They are patronising of Amazon for not making full use of the data they glean from their Kindle ebook ecosystem. They pay somewhat cursory attention to issues of data privacy and consent, and have the unusual idea of creating a cadre of algorithmists who would vet the probity of algorithms and applications in the manner of accountants doing audit or data protection officers.

So what is this book good for? It provides a nice range of examples of data analysis and some interesting stories regarding the use to which it has been put. It gives a fair overview of the value of data analysis and some of the risks it presents. It highlights that the term “big data” is used so broadly that it conveys little meaning. This confusion over what is meant by “Big Data” is reflected on the datascience@Berkeley blog which lists definitions of big data from 30 people in the field (here). Finally, it provides me with sufficient cover to make a supportable claim that I am a “Big Data scientist”!

To my mind, the best definition of big data that I’ve seen is that it is like teenage sex…

  • Everyone talks about it,
  • nobody really knows how to do it,
  • everyone thinks everyone else is doing it,
  • so everyone claims they are doing it too!

Book review: Greenwich Time and the Longitude by Derek Howse

greenwich_timeI am being used as a proxy reader! My colleague drj, impressed by my reviewing activities, asked me to read Greenwich Time and the Longitude by Derek Howse, so that he wouldn’t have to.

There was some risk here that Greenwich Time and the Longitude would overlap heavily with Finding Longitude which I have recently read. They clearly revolve around the same subjects and come from the same place: the National Maritime Museum at Greenwich. Happily the overlap is relatively minor. Following some brief preamble regarding the origins of latitude and longitude for specifying locations, Greenwich Time starts with the founding of the Royal Observatory at Greenwich.

The Observatory was set up under Charles II who personally ordered it’s creation in 1675, mindful of the importance of astronomy to navigation. The first Royal Astronomer was John Flamsteed. Accurate measurement of the locations of the moon and stars was a prerequisite for determining the longitude at sea both by lunar-distance and clock based means. Flamsteed’s first series of measurements was aimed at determining whether the earth rotated at a constant rate, something we take for granted but wasn’t necessarily the case.

Flamsteed is notorious for jealously guarding the measurements he made, and fell out with Isaac Newton over their early, unauthorised publication which Newton arranged. A detail I’d previously missed in this episode is that Flamsteed was not very well remunerated for his work, his £100 per annum salary had to cover the purchase of instruments as well as any skilled assistance he required which goes some way to explaining his possessiveness over the measurements he made. 

Greenwich Time covers the development of marine chronometers in the 18th century and the period of the Board of Longitude relatively quickly.

The next step is the distribution of time. Towards the middle of the 19th century three industries were feeling the need for precise timekeeping: telegraphy, the railways and the postal service. This is in addition to the requirements of marine navigators. The first time signal, in 1833, was distributed by the fall of a large painted zinc ball on the top of the Greenwich observatory. Thereafter, strikingly similar balls appeared on observatories around the world.

From 1852 the time signal was distributed by telegraphic means, and ultimately by radio. It was the radio time signal that ultimately brought an end to the publication of astronomical tables for navigation. Britain’s Nautical Almanac, started in 1767, stopped publishing them in 1907 – less than 10 years after the invention of radio.

With the fast distribution of time signals over large distances came the issue of the variation between local time (as defined by the sun and stars) and the standard time. The problem was particularly pressing in the United States which spanned multiple time zones. The culmination of this problem is the International Date Line, which passes through the Pacific. Here the day of the week changes on crossing the line, a problem discovered by the very first circumnavigators (Magellan’s expedition in 1522), identified when they reached travellers who had arrived from the opposite direction and disagreed on the day of the week. I must admit to being a bit impressed by this, I can imagine it’s easy to lose track of the days on such an expedition.

I found the descriptions of congresses to standardise the meridian and time systems across multiple nations in the 1880s rather dull.

One small thing of interest in these discussions: mariners used to measure the end of the day at noon, hence what we would call “Monday morning” a mariner would call “the end of Sunday”, unless he was at harbour – in which case he would use local time! It is from 18th century mariners that Jean Luc Picard appears to get his catchphrase “Make it so!”, this was the traditional response of a captain to the officer making the noon latitude measurement. The meridian congresses started the process of standardising the treatment of the day by “civilians”, mariners and astronomers.

The book finishes with a discussion of high precision timekeeping. This is where we discover that Flamsteed wasn’t entirely right when he measured the earth to rotate at a constant rate. The earth’s rotation is showing a long term decrease upon which are superimposed irregular variations and seasonal variations. And the length of the year is slowly changing too. Added to that, the poles drift by about 8 metres or so over time. It’s testament to our abilities that we can measure these imperfections but somehow sad that they exist.

The book has an appendix with some detail on various measurements.

Not as sumptuous a book as Finding Longitude it is an interesting read with a different focus. It has some overlap too with The History of Clocks and Watches by Eric Bruton.

Book review: Degrees Kelvin by David Lindley

How to start? I’ve read another book… degrees_kelvinDegrees Kelvin: A tale of genius, invention and tragedy by David Lindley. This is a biography of William Thomson, later Lord Kelvin, who lived 1824-1907.

Thomson lived at a time when the core of classical physics came into being, adding thermodynamics and electromagnetism to Newtonian mechanics. He played a significant role in creating these areas of study. As well as this he acted as a scientific advisor in the creation of the transatlantic telegraph, electric power transmission, marine compasses and a system of units for electromagnetism. He earned a substantial income from patents relating to telegraphy and maritime applications, and bought a blingy yacht (the Lalla Rookh) with the money.

He died a few years after the discovery of radioactivity, x-rays, special relativity and the first inklings of quantum mechanics – topics that were to form “modern physics”.

The book starts with William Thomas heading off to Cambridge to study maths. Prior to going he has already published in a mathematical journal on Philip Kelland’s misinterpretation of Fourier’s work on heat.

His father, James Thomson is a constant presence through his time in Cambridge in the form of a stream of letters, these days he’d probably be described as a “helicopter parent”. James Thomson is constantly concerned with his son falling in with the wrong sort at university, and with the money he is spending. James Thomson was a professor of mathematics at Glasgow University, William had attended his classes at the university along with his brother. Hence his rapid entry into academic publishing.

Fourier’s work Analytical Theory of Heat is representative of a style of physics which was active in France at the beginning of the 19th century. He built a mathematical model of the flow of heat in materials, with techniques for calculating the temperature throughout that body – one of which were the Fourier series – still widely used by scientists and engineers today. For this purpose the fundamental question of what heat was could be ignored. Measurements could be made of heat flow and temperature, and the model explained these outward signs. Fourier’s presentation was somewhat confused, which led Philip Kelland – in his book Theory of Heat to claim he was wrong. Thomson junior’s contribution was to clarify Fourier’s presentation and point out, fairly diplomatically, that Kelland was wrong. 

Slightly later the flow of letters from Thomson senior switches to encourage his son into the position held by the ailing William Meikleham, Professor of Natural Philosophy at Glasgow University – this project is eventually successful when Meikleham dies and Thomson takes the post in 1846. He retired from his position at Glasgow University in 1899.

William Thomson appears to have been innovative in teaching, introducing the laboratory class into the undergraduate degree, and later writing a textbook of classical physics, Treatise on Natural Philosophy, with his friend P.G. Tait.

Following his undergraduate studies at Cambridge, William goes to Paris, meeting many of the scientific community there at the time and working in the laboratory of Henri Regnault on thermodynamics. In both thermodynamics and electromagnetism Thomson plays a role in the middle age of the topic, not there at the start but not responsible for the final form of the subject. In both thermodynamics and electromagnetism Thomson’s role was in the “formalisation” of the physical models made by others. So he takes the idea of lines of force from Faraday’s electrical studies and makes them mathematical. The point of this exercise is that now the model can be used to make quantitative predictions in complex situations of, for example, the transmission of signals down submarine telegraph wires.

Commercial telegraphy came in to being around 1837, the first transatlantic cable was strung in 1857 – although it only worked briefly, and poorly for a few weeks. The first successful cable was laid in 1866. It’s interesting to compare this to the similarly rapid expansion of the railways in Britain. Thomson played a part from the earliest of the transatlantic cables. Contributing both theoretically and practically – he invented and patented the mirror galvanometer which makes reading weak signals easier.

It’s a cliché to say “X was no stranger to controversy” Thomson had his share – constantly needling geologists over the age of the earth and getting into spats regarding priority of James Joule on the work on inter-convertibility of energy. It sounds like he bears some responsibility for the air of superiority that physicists can sometime display over the other sciences. Although it should be said that he more played second fiddle to the more pugnacious P.G. Tait.

Later in life Thomson struggled to accept Maxwell’s formulation of electromagnetic theory, finding it too abstract – he was only interested in a theory with a tangible physical model beneath it. Maxwell’s theory had this at the start, an ever more complex system of gear wheels, but ultimately he cut loose from it. As an aside, the Maxwell’s equations we know today are very much an invention of Oliver Heaviside who introduced the vector calculus notation which greatly simplifies their appearance, he too cut his teeth on telegraphy.

At one point Lindley laments the fact Lord Kelvin has not had the reputation he deserves since his death. Reputation is a slippery thing, recognition amongst the general public is a fickle and no real guide to anything. Most practicing scientists pay little heed to the history of their subject, fragments are used as decoration for otherwise dull lectures.

It’s difficult to think of modern equivalents of William Thomson in science, his theoretical role is similar to that of Freeman Dyson or Richard Feynman. It’s not widely recognised but Albert Einstein, like Thomson, was active in making patent applications but does not seem to have benefitted financial from his patents. Thomson also plays the role of Victorian projector, such as Isambard Kingdom Brunel. Projects in the 21st century are no longer so obviously the work of one scientist/engineer/project manager/promoter these roles having generally been split into specialisms. 

I was intrigued to discover that Lindley apparently uses S.P. Thompson’s 1910 biography of Kelvin as his primary source, not mentioning at all the two volume Energy and Empire by Crosbie Smith and M. Norton Wise published in 1989.

Degrees Kelvin provides a useful entry into physics and technology in the 19th century, I am now curious about the rise of electricity and marine compasses!

Book review: Finding Longitude by Richard Dunn, Rebekah Higgitt

finding-longitudeMuch of my reading comes via twitter in the form of recommendations from historians of science, in this case I am reading a book co-authored by one of those historians: Finding Longitude by Richard Dunn (@lordoflongitude) and Rebekah Higgitt (@beckyfh).

I must admit I held off buying Finding Longitude for a while since it appeared to be an exhibition brochure, maybe not so good if you haven’t attended the exhibition. It turns out to be freestanding and perfectly formed.This is definitely the most sumptuous book I’ve bought in quite some time, I’m glad I got the hardcover version rather than the Kindle edition.

The many photographs throughout the book are absolutely gorgeous, they are of the instruments and clocks, the log books, artwork from the time. You can get a flavour from the images online here.

To give some context to the book, knowing your location on earth is a matter of determining two parameters: latitude and longitude:

  • latitude is your location in the North-South direction between the equator and either of the earth’s poles, it is easily determined by the height of the sun or stars above the horizon, and we shall speak no more of it here.
  • longitude is the second piece of information required to specify ones position on the surface of the earth and is a measure your location East-West relative to the Greenwich meridian. The earth turns at a fixed rate and as it does the sun appears to move through the sky. You can use this behaviour to fix a local noon time: the time at which the sun reaches the highest point in the sky. If, when you measure your local noon, you can also determine what time it is at some reference point Greenwich, for example, then you can find your longitude from the difference between the two times.

Knowing where you are on earth by measurement of these parameters is particularly important for sailors engaged in long distance trade or fighting. It has therefore long been an interest of governments.

The British were a bit late to the party in prizes for determining the longitude, the first of them had been offered by Phillip II of Spain in 1567 and there had been activity in the area since then, primarily amongst the Spanish and Dutch. Towards the end of the 17th century the British and French get in on the act, starting with the formation of the Royal Society and Académie des sciences respectively.

One stimulus for the creation of a British prize for determining the longitude was the deaths of 1600 British sailors from Admiral Sir Cloudsley Shovell’s fleet off the Isles of Scilly in 1707. They died on the rocks off the Isles of Scilly in a storm, as a result of not knowing where they were until it was too late. As an aside, the surviving log books from Shovell’s fleet showed that for the latitude (i.e. the easier thing to measure), measurements of the sun gave a 25 mile spread, and those from dead reckoning a 75 mile spread in location.

The Longitude Act was signed into law in 1714, it offered a prize of £20,000 to whoever produced a practicable method for determining the longitude at sea. There was something of the air that it was a problem about to be solved. The Board of Longitude was to judge the prize. The known competitor techniques at the time were timekeeping by mechanical means, two astronomical methods (the lunar distance method, and the satellites of Jupiter) and dead-reckoning. In fact these techniques are used in combination, mechanical timekeepers are simpler to use than the astronomical methods but mechanical timekeepers needed checking against the astronomical gold standard which was the only way to reset a stopped clock. Dead-reckoning (finding your location by knowing how fast you’d gone in what direction) was quick and simple, and worked in all weathers. Even with a mechanical timekeeper astronomical observations were required to measure the “local” time, and that didn’t work in thick cloud.

There’s no point in sailors knowing exactly where they were if maps did not describe exactly where the places where they were going, or trying to avoid. Furthermore, the lunar distance method of finding longitude required detailed tables of astronomical data which needed updating regularly. So alongside the activities of the longitude projectors, the state mechanisms for compiling charts and making astronomical tables were built up.

John Harrison and his timepieces are the most famous part of the longitude story. Harrison produced a series of clocks and watches from 1730 and 1760, in return he received moderate funding over the period from the Board of Longitude, you can see the payment record in this blog post here. Harrison felt hard done by since his final watches met the required precision but the Board of Longitude were reluctant to pay the full prize. Although meeting the technical specification in terms of their precision were far from a solution. Despite his (begrudging) efforts, they could not be reliably reproduced even by the most talented clock makers.

After Harrison’s final award several others made clocks based on his designs, these were tested in a variety of expeditions in the latter half of the 18th century (such as Cook’s to Tahiti in 1769). The naval expedition including hydrographers, astronomers, naturalists and artists became something of a craze (see also Darwin’s trip on the Beagle). As well as clocks, men such as Jesse Ramsden were mass producing improved instruments for navigational measurements, such as octants and sextants.

The use of chronometers to determine the longitude was not fully embedded into the Royal Navy until into the 19th century with the East Indian Company running a little ahead of them by having chronometers throughout their fleet by 1810.

Finding Longitude is a a good illustration of providing the full context for the adoption of a technology. It’s the most beautiful book I’ve read in while, and it doesn’t stint on detail.

Book review: The Shock of the Old by David Edgerton

shock_of_the_oldThe Shock of the Old by David Edgerton is a history of technology in the 20th century.

A central motivation of the book is that, according to the author, other histories of technology are wrong in that they focus overly on the dates and places of invention and pay little attention to the subsequent dissemination and use of technologies.

The book is divided thematically covering significance, time, production, maintenance, nations, war, killing and invention. Significance reports on the quantitative, economic significance of a technology, something on which there is surprisingly little data.

A recurring theme is the persistence of things we might consider to have been replaced by new technology, the horse, for example. It’s perhaps not surprising that a huge number of horses were used during the First World War but the German’s, the masters of the mechanised blitzkrieg, used 625,000 horses in 1941 when they invaded the Soviet Union. This isn’t the end of the story of animal power: Cuba, as a result of sanctions and the fall of the Soviet Union was using nearly 400,000 oxen by the end of the 1990s.

The same goes for battleships and military aircraft. When Britain and Argentina fought over the Falkland Islands in 1982, the Belgrano, originally commissioned into the US Navy in 1939 was sunk by 21-inch MK8 torpedoes originally designed in 1927! The Falklands airstrips were bombed from Britain by Vulcan bombers refuelled by Victor in-flight tankers, originally built in the 1950s. The reason for this is that the persistent technology actually does it’s job pretty well, the cost of replacing it for marginal benefit is too high and maintenance and repair means that to a degree these technological artefacts have been almost entirely rebuilt.

The time chapter expands on the idea that the introduction of technologies to different places is not simply a case of timeshifting, it depends on the local context. We find, for example, that horse draw carts are constructed from the parts of cars. And that corrugated iron and asbestos-cement are the material of choice for construction in the new slums of the developing world. Edgerton refers to these as ‘creole’ technologies – old technologies which have been repurposed into a new life.

In terms of technology and economic growth, it has really been mass production which has lead directly and obviously to economic growth particularly in the 30 years after the Second World War, known as the ‘long boom’. And whilst there was a boom in new technologies, all around the world the oldest technology – agriculture was also experiencing a boom in productivity – overshadowed by the new things.

As usual with such a book I picked up some useful facts to deploy at the dinner table:

  • The German V-2 rocket killed more people in its production than it did in its use.
  • The inventor of the Aga cooking range was a Nobel prize winning physicist.

For a scientist this book makes for an uncomfortable read in places since we come to the topic with some preconceived ideas and position, which are not necessarily grounded in the best of historical methods. For instance, Edgerton highlights that R&D spend just doesn’t correlate with economic growth. And that to a large degree the nation of invention is not the nation which benefits from an invention.

Perhaps most damning in the eyes of scientists, their bête noire, Simon Jenkins has supplied a cover quote:

The Shock of the Old is a book I can use. I can take it in two hands and bash it over the heads of every techno-nerd, computer geek and neophiliac futurologist I meet!

It’s a mistake to think of all scientists and computer geeks as being neophiliac. One of my colleagues works using an IBM Model M keyboard which we recently established was older than our intern, he also prefers the VIM editor – based on technology born in the 1970s. In the laboratory, the favoured computer language for scientific computation is still often FORTRAN, invented in the 1950s.

Thinking back over the other books I’ve read on the history of technology, for example A Computer Called LEO, Fire & Steam, The Subterranean Railway, Empire of the Clouds, The Idea Factory and The Backroom Boys. It is true to say they have very much focussed on single technologies or places but to my mind they have generally been pragmatic about the impact on society of their chosen subject. The authors have each had a definite passion for their topic leading to regret for what might have been: a thriving British aircraft industry, computer industry and so forth. But they don’t seem to provide the litany of dates and inventions of which Edgerton accuses them.

Despite this The Shock of the Old is readable, the author knows his field and provides a different viewpoint on the history of technology, more overarching, not so besotted. I’ll certainly be looking out for more of his books.