Author's posts
Mar 05 2015
Book review: Engineering Empires by Ben Marsden and Crosbie Smith
Commonly I read biographies of dead white men in the field of science and technology. My next book is related but a bit different: Engineering Empires: A Cultural History of Technology in Nineteenth-Century Britain by Ben Marsden and Crosbie Smith. This is a more academic tome but rather than focussing on a particular dead white man they are collected together in a broader story. A large part of the book is about steam engines with chapters on static steam engines, steamships and railways but alongside this are chapters on telegraphy and mapping and measurement.
The book starts with a chapter on mapping and measurement, there’s a lot of emphasis here on measuring the earth’s magnetic field. In the eighteen and nineteenth centuries there was some hope that maps of magnetic field variation might provide help in determining the longitude. The subject makes a reprise later on in the discussion on steamships. The problem isn’t so much the steam but that steamships were typically iron-hulled which throws compass measurements awry unless careful precautions are taken. This was important as steamships were promoted for their claimed superior safety over sailing vessels, but risked running aground on the reef of dodgy compass behaviour in inshore waters. The social context for this chapter is the rise of learned societies to promote such work, the British Association for the Advancement of Science is central here, and is a theme through the book. In earlier centuries the Royal Society was more important.
The next three chapters cover steam power, first in the factory and the mine then in boats and trains. Although James Watt plays a role in the development of steam power, the discussion here is broader covering Ericsson’s caloric engine amongst many other things. Two themes of steam are the professionalisation of the steam engineer, and efficiency. “Professionalisation” in the sense that when businessmen made investments in these relatively capital intensive devices they needed confidence in what they were buying into. A chap that appeared to have just knocked something up in his shed didn’t cut it. Students of physics will be painfully aware of thermodynamics and the theoretical efficiency of engines. The 19th century was when this field started, and it was of intense economic importance. For a static engine efficiency is important because it reduces running costs. For steamships efficiency is crucial, less coal for the same power means you don’t run out of steam mid-ocean!
Switching the emphasis of the book from people to broader themes casts the “heroes” in a new light. It becomes more obvious that Isambard Kingdom Brunel is a bit of an outlier, pushing technology to the limits and sometimes falling off the edge. The Great Eastern was a commercial disaster only gaining a small redemption when it came to lying transatlantic telegraph cables. Success in this area came with the builders of more modest steamships dedicated to particular tasks such as the transatlantic mail and trips to China.
The book finishes with a chapter on telegraphy, my previous exposure to this was via Lord Kelvin who had been involved in the first transatlantic electric telegraphs. The precursor to electric telegraphy was optical telegraphy which had started to be used in France towards the end of the 18th century. Transmission speeds for optical telegraphy were surprisingly high: Paris to Toulon (on the Mediterranean coast), a distance of more than 800km, in 20 minutes. In Britain the telegraph took off when it was linked with the railways which provided a secure, protected route by which to send the lines. Although the first inklings of electric telegraphy came in in mid-18th century it didn’t get going until 1840 or so but by 1880 it was a globe spanning network crossing the Atlantic and reaching the Far east overland. It’s interesting to see the mention of Julius Reuter and Associated Press back at the beginning of electric telegraphy, they are still important names now.
In both steamships and electric telegraphy Britain led the way because it had an Empire to run, and communication is important when you’re running an empire. Electric telegraphy was picked up quickly on the eastern seaboard of the US as well.
I must admit I was a bit put off by the introductory chapter of Engineering Empires which seemed to be a bit heavy and spoke in historological jargon but once underway I really enjoyed the book. I don’t know whether this was simply because I got used to the style or the style changed. As proper historians Marsden and Smith do not refer to scientists in the earlier years of the 19th century as such, they are “gentlemen of science” and later “men of science”. They sound a bit contemptuous of the “gentlemen of science”. The book is a bit austere and worthy looking. Overall I much prefer this manner of presentation of the wider context rather than a focus on a particular individual.
Feb 10 2015
Book review: Data Science at the Command Line by Jeroen Janssens
This review was first published at ScraperWiki.
In the mixed environment of ScraperWiki we make use of a broad variety of tools for data analysis. Data Science at the Command Line by Jeroen Janssens covers tools available at the Linux command line for doing data analysis tasks. The book is divided thematically into chapters on Obtaining, Scrubbing, Modeling, Interpreting Data with “intermezzo” chapters on parameterising shell scripts, using the Drake workflow tool and parallelisation using GNU Parallel.
The original motivation for the book was a desire to move away from purely GUI based approaches to data analysis (I think he means Excel and the Windows ecosystem). This is a common desire for data analysts, GUIs are very good for a quick look-see but once you start wanting to repeat analysis or even repeat visualisation they become more troublesome. And launching Excel just to remove a column of data seems a bit laborious. Windows does have its own command line, PowerShell, but it’s little used by data scientists. This book is about the Linux command line, examples are all available on a virtual machine populated with all of the tools discussed in the book.
The command line is at its strongest with the early steps of the data analysis process, getting data from places, carrying out relatively minor acts of tidying and answering the question “does my data look remotely how I expect it to look?”. Janssens introduces the battle tested tools sed, awk, and cut which we use around the office at ScraperWiki. He also introduces jq (the JSON parser), this is a more recent introduction but it’s great for poking around in JSON files as commonly delivered by web APIs. An addition I hadn’t seem before was csvkit which provides a suite of tools for processing CSV at the command line, I particularly like the look of csvstat. csvkit is a Python tool and I can imagine using it directly in Python as a library.
The style of the book is to provide a stream of practical examples for different command line tools, and illustrate their application when strung together. I must admit to finding shell commands deeply cryptic in their presentation with chunks of options effectively looking like someone typing a strong password. Data Science is not an attempt to clear the mystery of these options more an indication that you can work great wonders on finding the right incantation.
Next up is the Rio tool for using R at the command line, principally to generate plots. I suspect this is about where I part company with Janssens on his quest to use the command line for all the things. Systems like R, ipython and the ipython notebook all offer a decent REPL (read-evaluation-print-loop) which will convert seamlessly into an actual program. I find I use these REPLs for experimentation whilst I build a library of analysis functions for the job at hand. You can write an entire analysis program using the shell but it doesn’t mean you should!
Weka provides a nice example of smoothing the command line interface to an established package. Weka is a machine learning library written in Java, it is the code behind Data Mining: Practical Machine Learning Tools and techniques. The edges to be smoothed are that the bare command line for Weka is somewhat involved since it requires a whole pile of boilerplate. Janssens demonstrates nicely how to do this by developing automatically autocompletion hints for the parts of Weka which are accessible from the command line.
The book starts by pitching the command line as a substitute for GUI driven applications which is something I can agree with to at least some degree. It finishes by proposing the command line as a replacement for a conventional programming language with which I can’t agree. My tendency would be to move from the command line to Python fairly rapidly perhaps using ipython or ipython notebook as a stepping stone.
Data Science at the Command Line is definitely worth reading if not following religiously. It’s a showcase for what is possible rather than a reference book as to how exactly to do it.
Feb 09 2015
Book review: Remote Pairing by Joe Kutner
This review was first published at ScraperWiki.
Pair programming is an important part of the Agile process but sometimes the programmers are not physically co-located. At ScraperWiki we have staff who do both scheduled and ad hoc remote working therefore methods for working together remotely are important to us. A result of a casual comment on Twitter, I picked up “Remote Pairing” by Joe Kutner which covers just this subject.
Remote Pairing is a short volume, less than 100 pages. It starts for a motivation for pair programming with some presentation of the evidence for its effectiveness. It then goes on to cover some of the more social aspects of pairing – how do you tell your partner you need a “comfort break”? This theme makes a slight reprise in the final chapter with some case studies of remote pairing. And then into technical aspects.
The first systems mentioned are straightforward audio/visual packages including Skype and Google Hangouts. I’d not seen ScreenHero previously but it looks like it wouldn’t be an option for ScraperWiki since our developers work primarily in Ubuntu; ScreenHero only supports Windows and OS X currently. We use Skype regularly for customer calls, and Google Hangouts for our daily standup. For pairing we typically use appear.in which provides audio/visual connections and screensharing without the complexities of wrangling Google’s social ecosystem which come into play when we try to use Google Hangouts.
But these packages are not about shared interaction, for this Kutner starts with the vim/tmux combination. This is venerable technology built into Linux systems, or at least easily installable. Vim is the well-known editor, tmux allows a user to access multiple terminal sessions inside one terminal window. The combination allows programmers to work fully collaboratively on code, both partners can type into the same workspace. You might even want to use vim and tmux when you are standing next to one another. The next chapter covers proxy servers and tmate (a fork of tmux) which make the process of sharing a session easier by providing tunnels through the Cloud.
Remote Pairing then goes on to cover interactive screensharing using vnc and NoMachine, these look like pretty portable systems. Along with the chapter on collaborating using plugins for IDEs this is something we have not used at ScraperWiki. Around the office none of us currently make use of full blown IDEs despite having used them in the past. Several of us use Sublime Text for which there is a commercial sharing product (floobits) but we don’t feel sufficiently motivated to try this out.
The chapter on “building a pairing server” seems a bit out of place to me, the content is quite generic. Perhaps because at ScraperWiki we have always written code in the Cloud we take it for granted. The scheme Kutner follows uses vagrant and Puppet to configure servers in the Cloud. This is a fairly effective scheme. We have been using Docker extensively which is a slightly different thing, since a Docker container is not a virtual machine.
Are we doing anything different in the office as a result of this book? Yes – we’ve got a good quality external microphone (a Blue Snowball), and it’s so good I’ve got one for myself. Managing audio is still something that seems a challenge for modern operating systems. To a human it seems obvious that if we’ve plugged in a headset and opened up Google Hangouts then we might want to talk to someone and that we might want to hear their voice too. To a computer this seems unimaginable. I’m looking to try out NoMachine when a suitable occasion arises.
Remote Pairing is a handy guide for those getting started with remote working, and it’s a useful summary for those wanting to see if they are missing any tricks.
Jan 30 2015
Git–notes
I’ve discovered that my blog is actually a good place to put things I need to remember see, for example, my blog post on running Ubuntu in a VM on Windows 8.
In this spirit here are my notes on using git, the distributed version control system (DVCS). These are things I picked up around the office at ScraperWiki, I wrote something there about the scheme we use for Git. This is more a compendium of useful git commands.
I use Git on both Windows and Ubuntu and I have accounts with both GitHub and Bitbucket. I’ve configured ssh on my Windows and Ubuntu machines and use that for authentication. I Windows I interact with Git using Git Bash.
Installation
On installing Git I do the following setup, obviously using my own name and email:
git config --global user.name "John Doe"
git config --global user.email [email protected]
git config --global core.editor vim
I can list my config settings using:
git config -l
Starting a repo
To start a new repo we do:
git init
These days I feel bereft if I’m not “pushing” my local repository to an online repository like GitHub or BitBucket. To add a remote repository create one using the service of your choice which will probably ask you to do:
git remote add origin [url]
Alternatively you can clone an existing repository into a subdirectory of your current directory with the name of the repo:
git clone [url]
This one clones into current directory, making a mess if that’s not what you intended!
git clone [url] .
A variant, if you are using a repo with submodules in it, :
git clone –recursive [url]
If you forgot to do the above on first cloning then you can do:
git submodule update –init
Adding and committing files
If you’ve started a new repository then need to add some files to track:
git add [filename]
You don’t have to commit all the changes you made since the last commit, you can select them using the -p option
git add –p
And commit them to the repository with a commit command like:
git commit –m [message]
Alternatively you can add the commit message in your favoured editor with the difference from previous commit shown below:
git commit –a –v
I tend to use an remote repository as a backup so I regularly do:
git push origin HEAD
If someone else is working on the same repository as you then things get more complicated but that’s out of the scope of this post.
Undoing things
If you get your commit message wrong you can edit it with:
git commit --amend
If you decide you change your mind about staging a file for commit:
git reset HEAD [filename]
If you change your mind about the modifications you have made to a file since the last commit then you can revert to the last commit using this **destructive** command:
git checkout -- [filename]
You should be careful doing that since it will obliterate any changes you’ve made to a file, even if you saved them from the editor.
Working out where you are
You can list files in the repo with:
git ls-tree --full-tree -r HEAD
The general command for seeing what is going on is:
git status
This tells you if you have made edits which have not been staged, which branch you are on and files which are not being tracked. Whilst you are working you can see the difference from the previous commit using:
git diff
If you’ve already added files to commit then you need to do:
git diff –cached
You can see a list of all your changes using:
git log
This command gives you more information, in a more compact form:
git log --oneline --graph --decorate
is a good way of seeing the status of your branch and the other branches in the repository. I have aliased this log set of options as:
git lg
To do this I added the following to my ~/.gitconfig file:
[alias] lg = log --oneline --graph --decorate
Once you’ve commited a bunch of changes you might want to push them to a remote server. This pushes to the remote called origin, and HEAD ensures you push to your current branch. HEAD is Git’s shorthand for the latest commit on the current branch:
git push origin HEAD
Branches
The proceeding commands are how you’d work using a single master branch, if you were working alone on something simple, for example. If you are working with other people or on something more complicated then you probably want to work on a branch, you can make a new branch by doing:
git checkout –b [branch name]
You can find out what other branches are available by doing:
git branch –v -a
Once you are on a branch you can commit changes, and push them onto your remote server, just as if you were on the master branch.
Merging and rebasing
The excitement comes when you want to merge your changes onto the master branch or you want to get changes on your own branch made by someone else and pushed to the remote reposition. The quick and dirty way to do this is using
git pull
This does a fetch and merge all at the same time. The better way is to fetch the changes and then merge them:
git fetch –prune –all
git merge origin/master
If you are working with someone else then you may prefer to merge changes onto the master branch by making a pull request on GitHub or BitBucket.
Accepting Pull Requests from Forks
If someone makes a Pull Request based on their forked copy of a repo then you can download for testing by doing:
git fetch origin pull/ID/head:BRANCHNAME
Jan 20 2015
Book review: Sextant by David Barrie
The longitude and navigation at sea has been a recurring theme over the last year of my reading. Sextant by David Barrie may be the last in the series. It is subtitled “A Voyage Guided by the Stars and the Men Who Mapped the World’s Oceans”.
Barrie’s book is something of a travelogue, each chapter starts with an extract from his diary on crossing the Atlantic in a small yacht as a (late) teenager in the early seventies. Here he learnt something of celestial navigation. The chapters themselves are a mixture of those on navigational techniques and those on significant voyages. Included in the latter are voyages such of those of Cook and Flinders, Bligh, various French explorers including Bougainville and La Pérouse, Fitzroy’s expeditions in the Beagle and Shackleton’s expedition to the Antarctic. These are primarily voyages from the second half of the 18th century exploring the Pacific coasts.
Celestial navigation relies on being able to measure the location of various bodies such as the sun, moon, Pole star and other stars. Here “location” means the angle between the body and some other point such as the horizon. Such measurements can be used to determine latitude, and in rather more complex manner, longitude. Devices such as the back-staff and cross-staff were in use during the 16th century. During the latter half of the 17th century it became obvious that one method to determine the longitude would be to measure the location of the moon relative to the immobile background of stars, the so-called lunar distance method. To determine the longitude to the precision required by the Longitude Act of 1714 would require those measurements to be made to a high degree of accuracy.
Newton invented a quadrant device somewhat similar to the sextant in the late 17th century but the design was not published until his death in 1742, in the meantime Hadley and Thomas Godfrey made independent inventions. A quadrant is an eighth of a circle segment which allows measurements up to 90 degrees. A sextant subtends a sixth of a circle and allows measurements up to 120 degrees.
The sextant of the title was first made by John Bird in 1757, commissioned by a naval officer who had made the first tests on the lunar distance method for determining the longitude at sea using Tobias Meyer’s lunar distance tables.
Both quadrant and sextant are more sophisticated devices than their cross- and back-staff precursors. They comprise a graduated angular scale and optics to bring the target object and reference object together, and to prevent the user gazing at the sun with an unprotected eye. The design of the sextant changed little since its invention. As a scientist who has worked with optics they look like pieces of modern optical equipment in terms of their materials, finish and mechanisms.
Alongside the sextant the chronometer was the second essential piece of navigational equipment, used to provide the time at a reference location (such as Greenwich) to compare to local time to get the longitude. Chronometers took a while to become a reliable piece of equipment, at the end of Beagles 4 year voyage in 1830 only half of the 22 chronometers were still running well. Shackleton’s mission in 1914 suffered even more, with the final stretch of their voyage to South Georgia using the last working of 24 chronometers. Granted his ship, the Endeavour had been broken up by ice and they had escaped to Elephant Island in a small, open boat! Note the large numbers of chronometers taken on these voyages of exploration.
Barrie is of the more subtle persuasion in the interpretation of the history of the chronometer. John Harrison certainly played a huge part in this story but his chronometers were exquisite, expensive, unique devices*. Larcum Kendall’s K1 chronometer was taken by Cook on his 1769 voyage. Kendall was paid a total of £500 for this chronometer, made as a demonstration that Harrison’s work could be repeated. This cost should be compared to a sum of £2800 which the navy paid for the HMS Endeavour in which the voyage was made!
An amusing aside, when the Ordnance Survey located the Scilly Isles by triangulation in 1797 they discovered its location was 20 miles from that which had previously been assumed. Meaning that prior to their measurement the location of Tahiti was better known through the astronomical observations made by Cook’s mission.
The risks the 18th century explorers ran are pretty mind-boggling. Even if the expedition was not lost – such as that of La Pérouse – losing 25% of the crew was not exceptional. Its reminiscent of the Apollo moon missions, thankfully casualties were remarkably low, but the crews of the earlier missions had a pretty pragmatic view of the serious risks they were running.
This book is different from the others I have read on marine navigation, more relaxed and conversational but with more detail on the nitty-gritty of the process of marine navigation. Perhaps my next reading in this area will be the accounts of some of the French explorers of the late 18th century.
*In the parlance of modern server management Harrison’s chronometers were pets not cattle!