Git–notes

logo@2xI’ve discovered that my blog is actually a good place to put things I need to remember see, for example, my blog post on running Ubuntu in a VM on Windows 8.

In this spirit here are my notes on using git, the distributed version control system (DVCS). These are things I picked up around the office at ScraperWiki, I wrote something there about the scheme we use for Git. This is more a compendium of useful git commands.

I use Git on both Windows and Ubuntu and I have accounts with both GitHub and Bitbucket. I’ve configured ssh on my Windows and Ubuntu machines and use that for authentication. I Windows I interact with Git using Git Bash.

Installation

On installing Git I do the following setup, obviously using my own name and email:

git config --global user.name "John Doe"
git config --global user.email [email protected]
git config --global core.editor vim

I can list my config settings using:

git config -l

Starting a repo

To start a new repo we do:

git init

These days I feel bereft if I’m not “pushing” my local repository to an online repository like GitHub or BitBucket. To add a remote repository create one using the service of your choice which will probably ask you to do:

git remote add origin [url]

Alternatively you can clone an existing repository into a subdirectory of your current directory with the name of the repo:

git clone [url]

This one clones into current directory, making a mess if that’s not what you intended!

git clone [url] .

A variant, if you are using a repo with submodules in it, :

git clone –recursive [url]

If you forgot to do the above on first cloning then you can do:

git submodule update –init

Adding and committing files

If you’ve started a new repository then need to add some files to track:

git add [filename]

You don’t have to commit all the changes you made since the last commit, you can select them using the -p option

git add –p

And commit them to the repository with a commit command like:

git commit –m [message]

Alternatively you can add the commit message in your favoured editor with the difference from previous commit shown below:

git commit –a –v

I tend to use an remote repository as a backup so I regularly do:

git push origin HEAD

If someone else is working on the same repository as you then things get more complicated but that’s out of the scope of this post.

Undoing things

If you get your commit message wrong you can edit it with:

git commit --amend

If you decide you change your mind about staging a file for commit:

git reset HEAD [filename]

If you change your mind about the modifications you have made to a file since the last commit then you can revert to the last commit using this **destructive** command:

git checkout -- [filename]

You should be careful doing that since it will obliterate any changes you’ve made to a file, even if you saved them from the editor.

Working out where you are

You can list files in the repo with:

git ls-tree --full-tree -r HEAD

The general command for seeing what is going on is:

git status

This tells you if you have made edits which have not been staged, which branch you are on and files which are not being tracked. Whilst you are working you can see the difference from the previous commit using:

git diff

If you’ve already added files to commit then you need to do:

git diff –cached

You can see a list of all your changes using:

git log

This command gives you more information, in a more compact form:

git log --oneline --graph --decorate

is a good way of seeing the status of your branch and the other branches in the repository. I have aliased this log set of options as:

git lg

To do this I added the following to my ~/.gitconfig file:

[alias]
  
        lg = log --oneline --graph --decorate

Once you’ve commited a bunch of changes you might want to push them to a remote server. This pushes to the remote called origin, and HEAD ensures you push to your current branch. HEAD is Git’s shorthand for the latest commit on the current branch:

git push origin HEAD

Branches

The proceeding commands are how you’d work using a single master branch, if you were working alone on something simple, for example. If you are working with other people or on something more complicated then you probably want to work on a branch, you can make a new branch by doing:

git checkout –b [branch name]

You can find out what other branches are available by doing:

git branch –v -a

Once you are on a branch you can commit changes, and push them onto your remote server, just as if you were on the master branch.

Merging and rebasing

The excitement comes when you want to merge your changes onto the master branch or you want to get changes on your own branch made by someone else and pushed to the remote reposition. The quick and dirty way to do this is using

git pull

This does a fetch and merge all at the same time. The better way is to fetch the changes and then merge them:

git fetch –prune –all
git merge origin/master

If you are working with someone else then you may prefer to merge changes onto the master branch by making a pull request on GitHub or BitBucket.

Accepting Pull Requests from Forks

If someone makes a Pull Request based on their forked copy of a repo then you can download for testing by doing:

git fetch origin pull/ID/head:BRANCHNAME

Book review: Sextant by David Barrie

sextantThe longitude and navigation at sea has been a recurring theme over the last year of my reading. Sextant by David Barrie may be the last in the series. It is subtitled “A Voyage Guided by the Stars and the Men Who Mapped the World’s Oceans”.

Barrie’s book is something of a travelogue, each chapter starts with an extract from his diary on crossing the Atlantic in a small yacht as a (late) teenager in the early seventies. Here he learnt something of celestial navigation. The chapters themselves are a mixture of those on navigational techniques and those on significant voyages. Included in the latter are voyages such of those of Cook and Flinders, Bligh, various French explorers including Bougainville and La Pérouse, Fitzroy’s expeditions in the Beagle and Shackleton’s expedition to the Antarctic. These are primarily voyages from the second half of the 18th century exploring the Pacific coasts.

Celestial navigation relies on being able to measure the location of various bodies such as the sun, moon, Pole star and other stars. Here “location” means the angle between the body and some other point such as the horizon. Such measurements can be used to determine latitude, and in rather more complex manner, longitude. Devices such as the back-staff and cross-staff were in use during the 16th century. During the latter half of the 17th century it became obvious that one method to determine the longitude would be to measure the location of the moon relative to the immobile background of stars, the so-called lunar distance method. To determine the longitude to the precision required by the Longitude Act of 1714 would require those measurements to be made to a high degree of accuracy.

Newton invented a quadrant device somewhat similar to the sextant in the late 17th century but the design was not published until his death in 1742, in the meantime Hadley and Thomas Godfrey made independent inventions. A quadrant is an eighth of a circle segment which allows measurements up to 90 degrees. A sextant subtends a sixth of a circle and allows measurements up to 120 degrees.

The sextant of the title was first made by John Bird in 1757, commissioned by a naval officer who had made the first tests on the lunar distance method for determining the longitude at sea using Tobias Meyer’s lunar distance tables.

Both quadrant and sextant are more sophisticated devices than their cross- and back-staff precursors. They comprise a graduated angular scale and optics to bring the target object and reference object together, and to prevent the user gazing at the sun with an unprotected eye. The design of the sextant changed little since its invention. As a scientist who has worked with optics they look like pieces of modern optical equipment in terms of their materials, finish and mechanisms.

Alongside the sextant the chronometer was the second essential piece of navigational equipment, used to provide the time at a reference location (such as Greenwich) to compare to local time to get the longitude. Chronometers took a while to become a reliable piece of equipment, at the end of Beagles 4 year voyage in 1830 only half of the 22 chronometers were still running well. Shackleton’s mission in 1914 suffered even more, with the final stretch of their voyage to South Georgia using the last working of 24 chronometers. Granted his ship, the Endeavour had been broken up by ice and they had escaped to Elephant Island in a small, open boat! Note the large numbers of chronometers taken on these voyages of exploration.

Barrie is of the more subtle persuasion in the interpretation of the history of the chronometer. John Harrison certainly played a huge part in this story but his chronometers were exquisite, expensive, unique devices*. Larcum Kendall’s K1 chronometer was taken by Cook on his 1769 voyage. Kendall was paid a total of £500 for this chronometer, made as a demonstration that Harrison’s work could be repeated. This cost should be compared to a sum of £2800 which the navy paid for the HMS Endeavour in which the voyage was made!

An amusing aside, when the Ordnance Survey located the Scilly Isles by triangulation in 1797 they discovered its location was 20 miles from that which had previously been assumed. Meaning that prior to their measurement the location of Tahiti was better known through the astronomical observations made by Cook’s mission.

The risks the 18th century explorers ran are pretty mind-boggling. Even if the expedition was not lost – such as that of La Pérouse – losing 25% of the crew was not exceptional. Its reminiscent of the Apollo moon missions, thankfully casualties were remarkably low, but the crews of the earlier missions had a pretty pragmatic view of the serious risks they were running.

This book is different from the others I have read on marine navigation, more relaxed and conversational but with more detail on the nitty-gritty of the process of marine navigation. Perhaps my next reading in this area will be the accounts of some of the French explorers of the late 18th century.

*In the parlance of modern server management Harrison’s chronometers were pets not cattle!

Book review: Graph Databases by Ian Robinson, Jim Webber and Emil Eifrem

graphdatabases

This review was first posted at ScraperWiki.

Regular readers will know I am on a bit of a graph binge at the moment. In computer science and mathematics graphs are collections of nodes joined by edges, they have all sorts of applications including the study of social networks and route finding. Having covered graph theory and visualisation, I now move on to graph databases. I started on this path with Seven Databases in Seven Weeks which introduces the Neo4j graph database.

And so to Graph Databases by Ian Robinson, Jim Webber and Emil Eifrem which, despite its general title, is really a book about Neo4j. This is no big deal since Neo4j is the leading open source graph database.

This is not just random reading, we’re working on an EU project, NewsReader, which makes significant use of RDF – a type of graph-shaped data. We’re also working on a project for a customer which involves traversing a hierarchy of several thousand nodes. This leads to some rather convoluted joining operations when done on a SQL database, a graph database might be better suited to the problem.

The book starts with some definitions, identifying the types of graph database (property graph, hypergraph, RDF). Neo4j uses property graphs where nodes and edges are distinct items and each can hold properties. In contrast RDF graphs are expressed as triples which encompass both edges and nodes. In hypergraphs multiple edges can be expressed as a single item. A second set of definitions are regarding the types of graph processing system: graph databases and graph analytical engines. Neo4j is designed to provide good performance for database-like queries, acting as a backing store for a web application rather than an analytical engine to carry out offline calculations. There’s also an Appendix comparing NoSQL databases which feels like it should be part of the introduction.

A key feature of native graph databases, such as Neo4j, is “index-free adjacency”. The authors don’t seem to define this well early in the book but later on whilst discussing the internals of Neo4j it is all made clear: nodes and edges are stored as fixed length records with references to a list of nodes to which they are connected. This means its very fast to visit a node, and then iterate over all of its attached neighbours. The alternative index-based lookups may involve scanning a whole table to find all links to a particular node. It is in the area of traversing networks that Neo4j shines in performance terms compared to SQL.

As Robinson et al emphasise in motivating the use of graph databases: Other types of NoSQL database and SQL databases are not built fundamentally around the idea of relationships between data except in quite a constrained sense. For SQL databases there is an overhead to carrying out join queries which are SQLs way of introducing relationships. As I hinted earlier storing hierarchies in SQL databases leads to some nasty looking, slow queries. In practice SQL databases are denormalised for performance reasons to address these cases. Graph databases, on the other hand, are all about relationships.

Schema are an important concept in SQL databases, they are used to enforce constraints on a database i.e. “this thing must be a string” or “this thing must be in this set”. Neo4j describes itself as “schema optional”, the schema functionality seems relatively recently introduced and is not discussed in this book although it is alluded to. As someone with a small background in SQL the absence of schema in NoSQL databases is always the cause of some anxiety and distress.

A chapter on data modelling and the Cypher query language feels like the heart of the book. People say that Neo4j is “whiteboard friendly” in that if you can draw a relationship structure on a whiteboard then you can implement it in Neo4j without going through the rigmarole of making some normalised schema that doesn’t look like what you’ve drawn. This seems fair up to a point, your whiteboard scribbles do tend to be guided to a degree by what your target system is, and you can go wrong with your data model going from whiteboard to data model, even in Neo4j.

I imagine it is no accident that more recent query languages like Cypher and SPARQL look a bit like SQL. Although that said, Cypher relies on ASCII art to MATCH nodes wrapped in round brackets and edges (relationships) wrapped in square brackets with arrows –>  indicating the direction of relationships:

MATCH (node1)-[rel:TYPE]->(node2)
RETURN rel.property

which is pretty un-SQL-like!

Graph databases goes on to describe implementing an application using Neo4j. The example code in the book is in Java but there appears, in py2neo, to be a relatively mature Python client. The situation here seems to be in flux since searching the web brings up references to an older python-embedded library which is now deprecated. The book pre-dates Neo4j 2.0 which introduced some significant changes.

The book finishes with some examples from the real world and some demonstrations of popular graph theory analysis. I liked the real world examples of a social recommendation system, access control and parcel routing. The coverage of graph theory analysis was rather brief, and didn’t explicit use Cypher which would have made the presentation different from what you find in the usual graph theory textbooks.

Overall I have mixed feelings about this book: the introduction and overview sections are good, as is the part on Neo4j internals. It’s a rather slim volume, feels a bit disjointed and is not up to date with Ne04j 2.0 which has significant new functionality.  Perhaps this is not the arena for a dead-tree publication – the Neo4j website has a comprehensive set of reference and tutorial material, and if you are happy with a purely electronic version than you can get Graph Databases for free (here).

Review of the year: 2014

Once again I look back on a year of blogging. You can see what I’ve been up to on the index page of this blog.

I get the feeling that my blog is just for me and a few students trying to fake having done their set reading. I regularly use my blog to remember how to fix my Ubuntu installation, and to help me remember what I’ve read.

A couple of posts this year broke that pattern.

Of Matlab and Python compared the older, proprietary way of doing scientific computing with Matlab to the rapidly growing, now mature, alternative of the Python ecosystem. I’ve used Matlab for 15 years or so as a scientist. At my new job, which is more open source and software developer oriented, I use Python. My blog post struck a cord with those burnt by licensing issues with Matlab. Basically, with Matlab you pay for a core license and then pay for toolboxes which add functionality (and sometimes you only use a small part of that functionality). It’s even more painful if you are managing networked licenses serving users across the world.

My second blog post with a larger readership was Feminism. This started with the unprofessional attire choice of a scientist on the Rosetta/Philae comet landing mission but turned into a wider, somewhat confessional post on feminism. In a nutshell: women routinely experience abuse and threat of which I believe men are almost entirely oblivious. 

As before my blogging energies have been split between my own blog here, and the ScraperWiki blog. My personal blogging is dominated by book reviews these days as, to be honest, is my blogging at ScraperWiki. I blog about data science books on the ScraperWiki blog  – typically books about software – and anything else on this blog. “Anything else” is usually broadly related to the history of science and technology.

This year has been quite eclectic. I read about the precursors to Darwin and his theory of evolution, macroeconomics, the Bell Laboratories, railways, parenthood, technology in society, finding the longitude (twice), Lord Kelvin, ballooning, Pompeii and I’ve just finished a book on Nevil Maskelyne – Astronomer Royal in the second half of the 18th century. I think my favourite of these was Finding Longitude by Richard Dunn and Rebekah Higgitt not only is the content well written but it is beautifully presented.

Over on the ScraperWiki blog I reviewed a further 12 books, bingeing on graph theory and data mining. My favourites from the "work" set were Matthew A. Russell’s Mining the Social Web and Seven Databases in Seven Weeks. Mining the Social Web because it introduces a bunch of machine learning algorithms around interesting social data, and the examples are supplied as IPython notebooks run in a virtual machine. Seven Databases is different – it gives a whistle stop tour of various types of database but manages to give deep coverage quite quickly.

I continue to read a lot whilst not doing a huge amount of programming – as I observed last year. I did write a large chunk of the API to the EU NewsReader project we’re working on which involved me learning SPARQL – a query language for the semantic web. Obviously to learn SPARQL I read a book, Learning SPARQL, I also had some help from colleagues on the project.

I had a lot of fun visualising the traffic and history of the London Underground, I did a second visualisation post on whether to walk between Underground stations in London.

Back on this blog I did some writing about technology, talking about my favourite text editor (Sublime Text), my experiences with Apple, Ubuntu and Windows operating systems, the dinky Asus T100 Transformer laptop, and replacing my hard drive with an SSD (much easier than I thought it would be). The Asus is sadly unused it just doesn’t serve a useful purpose beside my tablet and ultrabook format laptop. The SSD drive is a revelation, it just makes everything feel quicker.

The telescope has been in the loft for much of the last year but I did a blog post on the Messier objects – nebulae and so forth, and I actually took an acceptable photo of the Orion nebula although this went unblogged.

Finally, the source of the photo at the top of the page, I visited San Sebastian for an EU project I’m working on. I only had my phone so the pictures aren’t that good.

Happy New Year!

Book review: Maskelyne – Astronomer Royal edited by Rebekah Higgitt

MaskelyneOver the years I’ve read a number of books around the Royal Observatory at Greenwich: books about finding the longitude or about people.

Maskelyne – Astronomer Royal edited by Rebekah Higgitt is unusual for me – it’s an edited volume of articles relating to Nevil Maskelyne by a range of authors rather than a single author work. Linking these articles are “Case Studies” written by Higgitt which provide background and coherence.

The collection includes articles on the evolution of Maskelyne’s reputation, Robert Waddington – who travelled with him on his St Helena trip, his role as a manager, the human computers used to calculate the tables in the Nautical Almanac, his interactions with clockmakers, his relationships with savants across Europe, his relationship with Joseph Banks, and his family life.

The Royal Observatory with its Astronomer Royal was founded by Charles II in 1675 with the goal of making astronomical observations to help with maritime navigation. The role gained importance in 1714 with the passing of the Longitude Act, which offered a prize to anyone who could present a practical method of finding the longitude at sea. The Astronomer Royal was one of the appointees to the Board of Longitude who judged applications. The observations and calculations done, and directed, from the Observatory were to form an important part of successful navigation at sea.

The post of Astronomy Royal was first held by John Flamsteed and then Edmund Halley. A persistent problem to the time of Maskelyne was the publication of the observations of the Astronomers Royal. Flamsteed and Newton notoriously fell out over such measurements. It seems very odd to modern eyes, but the observations the early Astronomers Royal made they essentially saw as their personal property, removed by executors on their death and thus lost to the nation. Furthermore, in the time of Maskelyne the Royal Observatory was not considered the pre-eminent observatory in Britain in terms of the quality of its instruments or observations.

Maskelyne’s appointment was to address these problems. He made the observations of the Observatory available to the Royal Society (the Visitors of the Observatory) on an annual basis and pushed for the publication of earlier observations. He made the making of observations a much more systematic affair, and he had a keen interest in the quality of the instruments used. Furthermore, he started the publication of the Nautical Almanac which provided sailors with a relatively quick method for calculating their longitude using the lunar distance method. He was keenly aware of the importance of providing accurate, reliable observational and calculated results.

He was appointed Astronomer Royal in 1765 not long after a trip to St Helena to make measurements of the first of a pair of Venus transits in 1761, to this he added a range of other activities which including testing the lunar distance method for finding longitude, the the “going” of precision clocks over an extended period and Harrison’s H4 chronometer. In later years he was instrumental in coordinating a number of further scientific expeditions doing things such as ensuring uniform instrumentation, providing detailed instructions for observers and giving voyages multiple scientific targets.

H4 is a primary reason for Maskelyne’s “notoriety”, in large part because of Dava Sobel’s book on finding the longitude where he is portrayed as the villain against the heroic clockmaker, John Harrison. By 1761 John Harrison had been working on the longitude problem by means of clocks for many years. Sobel’s presentation sees Maskelyne as a biased judge, favouring the Lunar distance method for determining longitude acting in his own interests against Harrison.

Professional historians of science have long felt that Maskelyne was hard done by Sobel’s biography. This book is not a rebuttal of Sobel’s but is written with the intention of bringing more information regarding Maskelyne to a general readership. It’s also stimulated by the availability of new material regarding Maskelyne.

Much of the book covers Maskelyne’s personal interactions with a range of people and groups. It details his exchanges with the “computers” who did the lengthy calculations which went into the Nautical Almanac; his interactions with a whole range of clockmakers for whom he often recommended to others looking for precision timepieces for astronomical purposes. It also discusses his relationships with other savants across Europe and the Royal Society. His relationship with Joseph Banks garners a whole chapter. A proposition in one chapter is that such personal, rather than institutional, relationships were key to 18th century science, I can’t help feeling this is still the case.

The theme of these articles is that Maskelyne was a considerate and competent man, going out of his way to help and support those he worked with. To my mind his hallmark is bringing professionalism to the business of astronomy.

In common with Finding Longitude this book is beautifully produced, and despite the multitude of authors it hangs together nicely. It’s not really a biography of Maskelyne but perhaps better for that.