Dr Administrator

Author's posts

Book review: Falling Upwards by Richard Holmes

fallingupwardsI read Richard Holmes book The Age of Wonder some time ago, in it he made a brief mention of balloons in the 18th century. It pricked my curiosity, so when I saw his book Falling Upwards, all about balloons, I picked it up.

The chapters of Falling Upwards cover a series of key points in the development of ballooning, typically hydrogen balloons from the last couple of decades of the 18th century to the early years of the 20th century. One of the early stories is a flight from my own home city, Chester. Thomas Baldwin recorded his flight in Airopaidia: Containing the Narrative of a Balloon Excursion from Chester, the eighth of September, 1785. The book does not have the air of a rigorous history of ballooning, it introduces technical aspects but not systematically. It is impressionistic to a degree, and as a result a rather pleasant read. For Holmes the artistic and social impact of balloons are as important as the technical.

In the beginning there was some confusion as to the purposes to which a balloon might be put, early suggestions included an aid to fast messengers who would stay on the ground to provide but use a small balloon to give them “10 league boots”, there were similar suggestions for helping heavy goods vehicles.

In practice for much of the period covered balloons were used mainly for entertainment – both for pleasure trips but also aerial displays involving acrobatics and fireworks. Balloons were also used for military surveillance.  Holmes provides chapters on their use in the American Civil War by the Union side (and very marginally by the Confederates). And in the Franco-Prussian war they were used to break the Prussian siege of Paris (or at least bend it). The impression gained though is that they were something like novelty items for surveillance. By the time of the American Civil War in the 1860’s it wasn’t routine or obvious that one must use balloon surveillance, it wasn’t a well established technique. This was likely a limitation of both the balloons themselves and the infrastructure required to get them in the air.

Balloons gave little real utility themselves, except in exceptional circumstances, but they made a link to heavier-than-air flight. They took man into the air, and showed the possibilities but for practical purposes generally didn’t deliver – largely due to their unpredictability. To a large extent you have little control of where you will land in a balloon once you have gone up. Note, for example, that balloons were used to break the Prussian siege of Paris in the outbound direction only. A city the size of Paris is too small a target to hit, even for highly motivated fliers.

Nadar (pseudonym of Gaspard-Félix Tournachon), who lived in Paris, was one of the big promoters of just about anything. He fought a copyright battle with his brother over his, adopted, signature. Ballooning was one of his passions, he inspired Jules Verne to starting writing science fiction. His balloon, Le Géant, launched in 1863 was something of a culmination in ballooning – it was enormous – 60 metres high but served little purpose other than to highlight the limitations of the form – as was Nadar’s intent.

From a scientific point of view Falling Upwards covers James Glaisher and Henry Coxwell’s flights in the mid-nineteenth century. I was impressed by Glaisher’s perseverance in taking manual observations at a rate of one every 9 seconds throughout a 90 minute flight. Glaisher had been appointed by the British Association for the Advancement of Science to do his work, he was Superintendent for Meteorology and Magnetism at the Royal Greenwich Observatory. With his pilot Henry Coxwell he made a record-breaking ascent to approximately 8,800 meters in 1862, a flight they were rather lucky to survive. Later in the 19th century other scientists were to start to identify the layers in the atmosphere. Discovering that it is only a thin shell – 5 miles or so thick which is suitable for life.

The final chapter is on the Salomon Andrée’s attempt to reach the North Pole by balloon, as with so many polar stories it ends in cold, lonely, perhaps avoidable death for Andrée and his two colleagues. Their story was discovered when the photos and journals were recovered from White Island in the Artic Circle, some 30 years after they died.

Falling Upwards is a rather conversational history. Once again I’m struck by the long periods for technology to reach fruition. It’s true that from a technology point of view that heavier-than-air flight is very different from ballooning. But it’s difficult to imagine doing the former without the later.

Of Matlab and Python

I’ve been a scientist and data analyst for nearly 25 years. Originally as an academic physicist, then as a research scientist in a large fast moving consumer goods company and now at a small technology company in Liverpool. In common to many scientists of my age I came to programming in the early eighties when a whole variety of home computers briefly flourished. My first formal training in programming was FORTRAN after which I have made my own way.

I came to Matlab in the late nineties, frustrated by the complexities of producing a smooth workflow with FORTRAN involving interaction, analysis and graphical output.

Matlab is widely used in academic circles and a number of industries because it provides a great deal of analytical power in a user-friendly environment. Its notation for handling matrix (array) calculations is slick. Its functionality is extended by a range of toolboxes, and there is a community of scientists sharing new functionality. It shares this feature set with systems such as IDL and PV-WAVE.

However, there are a number of issues with Matlab:

  • as a programming language it has the air of new things being botched onto a creaking frame. Support for unit testing is an afterthought, there is some integration of source control into the Matlab environment but it is with Source Safe. It doesn’t support namespaces. It doesn’t support common data structures such as dictionaries, lists and sets.
  • The toolbox ecosystem is heavily focused on scientific applications, generally in the physical sciences. So there is no support for natural language processing, for example, or building a web application based on the powerful analysis you can do elsewhere in the ecosystem;
  • the licensing is a nightmare. Once you’ve got core Matlab additional toolboxes containing really useful functionality (statistics, database connections, a “compiler”) are all at an additional cost. You can investigate pricing here. In my experience you often find yourself needing a toolbox for just a couple of functions. For an academic things are a bit rosier, universities get lower price licenses and the process by which this is achieved is opaque to end-users. As an industrial user, involved in the licensing process, it is as bad as line management and sticking needles in your eyes in the “not much fun thing to do” stakes;
  • running Matlab with network licenses means that your code may stop running part way through because you’ve made a call to a function to which you can’t currently get the license. It is difficult to describe the level of frustration and rage this brings. Now of course one answer is to buy individual licenses for all, or at least a significant surplus of network licenses. But tell that to the budget holder particularly when you wanted to run the analysis today. The alternative is to find one of the license holders of the required toolbox and discover if they are actually using it or whether they’ve gone off for a three hour meeting leaving Matlab open;
  • deployment to users who do not have Matlab is painful. They need to download a more than 500MB runtime, of exactly the right version and the likelihood is they will be installing it just for your code;

I started programming in Python at much the same time as I started on Matlab. At the time I scarcely used it for analysis but even then when I wanted to parse the HTML table of contents for Physical Review E, Python was the obvious choice. I have written scrapers in Matlab but it involved interfering with the Java underpinnings of the language.

Python has matured since my early use. It now has a really great system of libraries which can be installed pretty much trivially, they extend far beyond those offered by Matlab. And in my view they are of very good quality. Innovation like IPython notebooks take the Matlab interactive style of analysis and extend it to be natively web-based. If you want a great example of this, take a look at the examples provided by Matthew Russell for his book, Mining the Social Web.

Python is a modern language undergoing slow, considered improvement. That’s to say it doesn’t carry a legacy stretching back decades and changes are small, and directed towards providing a more consistent language. Its used by many software developers who provide a source of help, support and an impetus for an decent infrastructure.

Ubuntu users will find Python pre-installed. For Windows users, such as myself, there are a number of distributions which bundle up a whole bunch of libraries useful for scientists and sometimes an IDE. I like python(x,y). New libraries can generally be installed almost trivially using the pip package management system. I actually use Python in Ubuntu and Windows almost equally often. There are a small number of libraries which are a bit more tricky to install in Windows – experienced users turn to Christoph Gohlke’s fantastic collection of precompiled binaries.

In summary, Matlab brought much to data analysis for scientists but its time is past. An analysis environment built around Python brings wider functionality, a better coding infrastructure and freedom from licensing hell.

Inordinately fond of beetles… reloaded!

sciencemuseum_logo

This post was first published at ScraperWiki.

Some time ago, in the era before I joined ScraperWiki I had a play with the Science Museums object catalogue. You can see my previous blog post here. It was at a time when I was relatively inexperienced with the Python programming language and had no access to Tableau, the visualisation software. It’s a piece of work I like to talk about when meeting customers since it’s interesting and I don’t need to worry about commercial confidentiality.

The title comes from a quote by J.B.S. Haldane, who was asked what his studies in biology had told him about the Creator. His response was that, if He existed then he was “inordinately fond of beetles”.

The Science Museum catalogue comprises three CSV files containing information on objects, media and events. I’m going to focus on the object catalogue since it’s the biggest one by a large margin – 255,000 objects in a 137MB file. Each object has an ID number which often encodes the year in which the object was added to the collection; a title, some description, it often has an “item name” which is a description of the type of object, there is sometimes information on the date made, the maker, measurements and whether it represents part or all of an object. Finally, the objects are labelled according to which collection they come from and which broad group in that collection, the catalogue contains objects from the Science Museum, Nation Railway Museum and National Media Museum collections.

The problem with most of these fields is that they don’t appear to come from a controlled vocabulary.

Dusting off my 3 year old code I was pleased to discover that the SQL I had written to upload the CSV files into a database worked almost first time, bar a little character encoding. The Python code I’d used to clean the data, do some geocoding, analysis and visualisation was not in such a happy state. Or rather, having looked at it I was not in such a happy state. I appeared to have paid no attention to PEP-8, the Python style guide, no source control, no testing and I was clearly confused as to how to save a dictionary (I pickled it).

In the first iteration I eyeballed the data as a table and identified a whole bunch of stuff I thought I needed to tidy up. This time around I loaded everything into Tableau and visualised everything I could – typically as bar charts. This revealed that my previous clean up efforts were probably not necessary since the things I was tidying impacted a relatively small number of items. I needed to repeat the geocoding I had done. I used geocoding to clean up the place of manufacture field, which was encoded inconsistently. Using the Google API via a Python library I could normalise the place names and get their locations as latitude – longitude pairs to plot on a map. I also made sure I had a link back to the original place name description.

The first time around I was excited to discover the Many Eyes implementation of bubble charts, this time I now realise bubble charts are not so useful. As you can see below in these charts showing the number of items in each subgroup. In a sorted bar chart it is very obvious which subgroup is most common and the relative sizes of the subgroup. I’ve coloured the bars by the major collection to which they belong. Red is the Science Museum, Green is the National Rail Museum and Orange is the National Media Museum.

image

Less discerning members of ScraperWiki still liked the bubble charts.

image

We can see what’s in all these collections from the item name field. This is where we discover that the Science Museum is inordinately fond of bottles. The most common items in the collection are posters, mainly from the National Rail Museum but after that there are bottles, specimen bottles, specimen jars, shops rounds (also bottles), bottle, drug jars, and albarellos (also bottles). This is no doubt because bottles are typically made of durable materials like glass and ceramics, and they have been ubiquitous in many milieu, and they may contain many and various interesting things.

image

Finally I plotted the place made for objects in the collection, this works by grouping objects by location and then finding latitude and longitude for those group location. I then plot a disk sized by the number of items originating at that location. I filtered out items whose place made was simply “England” or “London” since these made enormous blobs that dominated the map.

 

image

 

You can see a live version of these visualisation, and more on Tableau Public.

It’s an interesting pattern that my first action on uploading any data like this to Tableau is to do bar chart frequency plots for each column in the data, this could probably be automated.

In summary, the Science Museum is full of bottles and posters, Tableau wins for initial visualisations of a large and complex dataset.

Book review: Big data by Viktor Mayer-Schönberger and Kenneth Cukier

BigData

This review was first published at ScraperWiki.

We hear a lot about “Big Data” at ScraperWiki. We’ve always been a bit bemused by the tag since it seems to be used indescriminately. Just what is big data and is there something special I should do with it? Is it even a uniform thing?

I’m giving a workshop on data science next week and one of the topics of interest for the attendees is “Big Data”, so I thought I should investigate in a little more depth what people mean by “Big Data”. Hence I have read Big Data by Viktor Mayer-Schönberger and Kenneth Cukier, subtitled “A Revolution That Will Transform How We Live, Work and Think” – chosen for the large number of reviews it has attracted on Amazon. The subtitle is a guide to their style and exuberance.

Their thesis is that we can define big data, in contrast to earlier “little data”, by three things:

  • It’s big but not necessarily that big, their definition for big is that n = all. That is to say that in some domain you take all of the data you can get hold of. They use as one example a study on bout fixing in sumo wrestling, based on  data on 64,000 bouts – which would fit comfortably into a spreadsheet. Other data sets discussed are larger such as credit card transaction data, mobile telephony data, Google’s search query data…;
  • Big data is messy, it is perhaps incomplete or poorly encoded. We may not have all the data we want from every event, it may be encoded using free text rather than strict categories and so forth;
  • Working with big data we must discard an enthusiasm for causality and replace it with correlation. Working with big data we shouldn’t mind too much if our results are just correlations rather than explanations (causation);
  • An implicit fourth element is that the analysis you are going to apply to your big data is some form of machine learning.

I have issues with each of these “novel” features:

Scientists have long collected datasets and done calculations that are at the limit (or beyond) their ability to process the data produced. Think protein x-ray crystallography, astronomical data for navigation, the CERN detectors etc etc. You can think of the decadal censuses run by countries such as the US and UK as n = all. Or the data fed to the early LEO computer to calculate the deliveries required for each of their hundreds of teashops. The difference today is that people and companies are able to effortlessly collect a larger quantity of data than ever before. They’re able to collect data without thinking about it first. The idea of n = all is not really a help. The straw man against which it is placed is the selection of a subset of data by sampling.

They say that big data is messy implying that what went before was not. One of the failings of the book is their disregard for those researchers that have gone before. According to them the new big data analysts are comfortable with messiness and uncertainty, unlike those fuddy-duddy statisticians! Small data is messy, scientists and statisticians have long dealt with messy and incomplete data.

The third of their features: we must be comfortable with correlation rather than demand causation. There are many circumstances where correlation is OK, such as when Amazon uses my previous browsing and purchase history to suggest new purchases but the area of machine learning / data mining has long struggled with messiness and causality.

This is not to say nothing has happened in the last 20 or so years regarding data. The ubiquity of computing devices, cheap storage and processing power and the introduction of frameworks like Hadoop are all significant innovations in the last 20 years. But they grow on things that went before, they are not a paradigm shift. Labelling something as ‘big data’, so ill-defined, provides no helpful insight as to how to deal with it.

The book could be described as the “What Google Did Next…” playbook. It opens with Google’s work on flu trends, passes through Google’s translation work and Google Books project. It includes examples from many other players but one gets the impression that it is Google they really like. They are patronising of Amazon for not making full use of the data they glean from their Kindle ebook ecosystem. They pay somewhat cursory attention to issues of data privacy and consent, and have the unusual idea of creating a cadre of algorithmists who would vet the probity of algorithms and applications in the manner of accountants doing audit or data protection officers.

So what is this book good for? It provides a nice range of examples of data analysis and some interesting stories regarding the use to which it has been put. It gives a fair overview of the value of data analysis and some of the risks it presents. It highlights that the term “big data” is used so broadly that it conveys little meaning. This confusion over what is meant by “Big Data” is reflected on the datascience@Berkeley blog which lists definitions of big data from 30 people in the field (here). Finally, it provides me with sufficient cover to make a supportable claim that I am a “Big Data scientist”!

To my mind, the best definition of big data that I’ve seen is that it is like teenage sex…

  • Everyone talks about it,
  • nobody really knows how to do it,
  • everyone thinks everyone else is doing it,
  • so everyone claims they are doing it too!

Book review: Greenwich Time and the Longitude by Derek Howse

greenwich_timeI am being used as a proxy reader! My colleague drj, impressed by my reviewing activities, asked me to read Greenwich Time and the Longitude by Derek Howse, so that he wouldn’t have to.

There was some risk here that Greenwich Time and the Longitude would overlap heavily with Finding Longitude which I have recently read. They clearly revolve around the same subjects and come from the same place: the National Maritime Museum at Greenwich. Happily the overlap is relatively minor. Following some brief preamble regarding the origins of latitude and longitude for specifying locations, Greenwich Time starts with the founding of the Royal Observatory at Greenwich.

The Observatory was set up under Charles II who personally ordered it’s creation in 1675, mindful of the importance of astronomy to navigation. The first Royal Astronomer was John Flamsteed. Accurate measurement of the locations of the moon and stars was a prerequisite for determining the longitude at sea both by lunar-distance and clock based means. Flamsteed’s first series of measurements was aimed at determining whether the earth rotated at a constant rate, something we take for granted but wasn’t necessarily the case.

Flamsteed is notorious for jealously guarding the measurements he made, and fell out with Isaac Newton over their early, unauthorised publication which Newton arranged. A detail I’d previously missed in this episode is that Flamsteed was not very well remunerated for his work, his £100 per annum salary had to cover the purchase of instruments as well as any skilled assistance he required which goes some way to explaining his possessiveness over the measurements he made. 

Greenwich Time covers the development of marine chronometers in the 18th century and the period of the Board of Longitude relatively quickly.

The next step is the distribution of time. Towards the middle of the 19th century three industries were feeling the need for precise timekeeping: telegraphy, the railways and the postal service. This is in addition to the requirements of marine navigators. The first time signal, in 1833, was distributed by the fall of a large painted zinc ball on the top of the Greenwich observatory. Thereafter, strikingly similar balls appeared on observatories around the world.

From 1852 the time signal was distributed by telegraphic means, and ultimately by radio. It was the radio time signal that ultimately brought an end to the publication of astronomical tables for navigation. Britain’s Nautical Almanac, started in 1767, stopped publishing them in 1907 – less than 10 years after the invention of radio.

With the fast distribution of time signals over large distances came the issue of the variation between local time (as defined by the sun and stars) and the standard time. The problem was particularly pressing in the United States which spanned multiple time zones. The culmination of this problem is the International Date Line, which passes through the Pacific. Here the day of the week changes on crossing the line, a problem discovered by the very first circumnavigators (Magellan’s expedition in 1522), identified when they reached travellers who had arrived from the opposite direction and disagreed on the day of the week. I must admit to being a bit impressed by this, I can imagine it’s easy to lose track of the days on such an expedition.

I found the descriptions of congresses to standardise the meridian and time systems across multiple nations in the 1880s rather dull.

One small thing of interest in these discussions: mariners used to measure the end of the day at noon, hence what we would call “Monday morning” a mariner would call “the end of Sunday”, unless he was at harbour – in which case he would use local time! It is from 18th century mariners that Jean Luc Picard appears to get his catchphrase “Make it so!”, this was the traditional response of a captain to the officer making the noon latitude measurement. The meridian congresses started the process of standardising the treatment of the day by “civilians”, mariners and astronomers.

The book finishes with a discussion of high precision timekeeping. This is where we discover that Flamsteed wasn’t entirely right when he measured the earth to rotate at a constant rate. The earth’s rotation is showing a long term decrease upon which are superimposed irregular variations and seasonal variations. And the length of the year is slowly changing too. Added to that, the poles drift by about 8 metres or so over time. It’s testament to our abilities that we can measure these imperfections but somehow sad that they exist.

The book has an appendix with some detail on various measurements.

Not as sumptuous a book as Finding Longitude it is an interesting read with a different focus. It has some overlap too with The History of Clocks and Watches by Eric Bruton.