Book review: Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson

sevendatabases

This review was first published at Scraperwiki.

I came to databases a little late in life, as a physical scientist I didn’t have much call for them. Then a few years ago I discovered the wonders of relational databases and the power of SQL. The ScraperWiki platform strongly encourages you to save data to SQLite databases to integrate with its tools.

There is life beyond SQL databases much of it evolved in the last few years. I wanted to learn more and a plea on twitter quickly brought me a recommendation for Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson.

The book covers the key classes of database starting with relational databases in the form of PostgreSQL. It then goes on to look at six further databases in the so-called NoSQL family – all relatively new compared to venerable relational databases. The six other databases fall into several classes: Riak and Redis are key-value stores, CouchDB and MongoDB are document databases, HBase is a columnar database and Neo4J is a graph database.

Relational databases are characterised by storage schemas involving multiple interlinked tables containing rows and columns, this layout is designed to minimise the repetition of data and to provide maximum query-ability. Key-value stores only store a key and a value in the manner of a dictionary but the “value” may be of a complex type. A value can be returned very fast given a key – this is the core strength of the key-value stores. The document stores MongoDB and CouchDB store JSON “documents” rather than rows. These documents can store information in nested hierarchies which don’t necessarily need to all have the same structure this allows maximum flexibility in the type of data to be stored but at the cost of ease of query.

HBase fits into the Hadoop ecosystem, the language used to describe it looks superficially like that used to describe tables in a relational database but this is a bit misleading. HBase is designed to work with massive quantities of data but not necessarily give the full querying flexibility of SQL. Neo4J is designed to store graph data – collections of nodes and edges and comes with a query language particularly suited to querying (or walking) data so arranged. This seems very similar to triplestores and the SPARQL – used in semantic web technologies.

Relational databases are designed to give you ACID (Atomicity, Consistency, Isolation, Durability), essentially you shouldn’t be able to introduce inconsistent changes to the database and it should always give you the same answer to the same query. The NoSQL databases described here have a subtly different core goal. Most of them are designed to work on the web and address CAP (Consistency, Availability, Partition), indeed several of them offer native REST interfaces over HTTP which means they are very straightforward to integrate into web applications. CAP refers to the ability to return a consistent answer, from any instance of the database, in the face of network (or partition) problems. This assumes that these databases may be stored in multiple locations on the web. A famous theorem contends that you can have any two of Consistency, Availability and Partition resistance at any one time but not all three together.

NoSQL databases are variously designed to scale horizontally and vertically. Horizontal scaling means replicating the same database in multiple places to provide greater capacity to serve requests even with network connectivity problems. Vertically scaling by “sharding” provides the ability to store more data by fragmenting the data such that some items are stored on one server and some on another.

I’m not a SQL expert by any means but it’s telling that I learnt a huge amount about PostgreSQL in the forty or so pages on the database. I think this is because the focus was not on the SQL query language but rather on the infrastructure that PostgreSQL provides. For example, it discusses triggers, rules, plugins and specialised indexing for text search. I assume this style of coverage applies to the other databases. This book is not about the nitty-gritty of querying particular database types but rather about the different database systems.

The NoSQL databases generally support MapReduce style queries this is a scheme most closely associated with Big Data and the Hadoop ecosystem but in this instance it is more a framework for doing queries which maybe executed across a cluster of computers.

I’m on a bit of a graph theory binge at the moment so Neo4J was the most interesting to me.

As an older data scientist I have a certain fondness for things that have been around for a while, like FORTRAN and SQL databases, I’ve looked with some disdain at these newfangled NoSQL things. To a degree this book has converted me, at least to the point where I look at ScraperWiki projects and think – “It might be better to use a * database for this piece of work”.

This is an excellent book which was pitched at just the right level for my purposes, I’ll be looking for more Pragmatic Programmers books in future.

Exploring the ONS

This post was first published at ScraperWiki.

The Office for National Statistics (ONS) is the United Kingdom statistical body charged by the government with the task of collecting and publishing  statistics related to the economy, population and society of England and Wales at national, regional and local levels. The data is typically published in the form of Excel spreadsheets.

The ONS is working on opening up their data, and making it more accessible to users. We’ve been doing a bit of work to help with that. This is typical of a number of jobs we have done. A customer has a website containing content which they want to move/process/republish elsewhere. The current website might have been built by aggregation over a number of years, and the underlying structure of the Content Management System may not be available to them. In these circumstances making a survey of the pre-existing content is an obvious first step.

The index for the ONS reference tables and datasets can be found here. Each dataset has a title, a release date, and the type of dataset. There is also a URL to the dataset inside the title field there is an indication of the size of the file. We wrote a simple scraper to collect these pieces of information.

First up, we’ll looking at the topics of the data released. There are a couple of routes into discovering these, one is to read the titles, this is OK as an approach but the titles are quite wordy and sometimes it isn’t clear what they refer to. An alternative, in this case, is to look at the URLs to the documents.They look something like this:

http://www.ons.gov.uk/ons/rel/lms/labour-market-statistics/november-2013/table-unem03.xls

This can be quite revealing since even if the website is not explicit about its structure the URLs can reveal the structure the builder used. We process the URLs by splitting them at the backslashes. The first part http://www.ons.gov.uk/ons/rel is common to all the URLs. Subsequent parts we can use to define a hierarchy. In this case we will focus on the fourth part of the hierarchy – “labour-market-statistics” in this instance, this gives us a human readable description of a topic. There are approximately 400 topics as defined by this metric as opposed to 90 or so defined by the third level of the hierarchy. Using the fourth level of the hierarchy key areas of the website by numbers of documents are:

  • Labour market statistics
  • National population projections
  • Family spending
  • Subnational labour market statistics
  • Census
  • Annual survey of hours and earnings.

We can visualise this as a treemap, here I am simply showing the top 20 areas by number of documents:

ONS-treemap-(simple)-v2

These 20 topics cover approximately the two thirds of the total number of documents.

We can identify file types using the file extension in the URL, this approach needs to be used a little cautiously since sometimes the extension doesn’t match the file type. Most of the files are Excel spreadsheets although there are a few CSV and zip files, the zip files containing Excel spreadsheets. CSV appears to have been used for some of the older datasets. Most of the files are pretty small, less than 290kb but there are a few up to much larger sizes.

Finally we can look at the release dates for the datasets. There are datasets from as far back as 1988, in fact the data set released in 1988 actually refers to data from 1984. There are some data released regularly from about 2001 but from 2011 a wider range of data has been released on a regular basis. We can see the monthly pattern of data releases in this timeline for 2014 which is restricted to the top 20 topics identified above:

Timeline-(detail)-v2

This shows the big releases of labour market statistics, both national and regional, on the third Wednesday of each month. Other monthly releases include retail sales and producer prices data. And every week provisional figures on the registration of deaths in England and Wales are reported.

You can explore these data yourself using the Tableau workbook here.

The actual content of these spreadsheets is another story.

This survey approach to a website is handy for a range of applications, and the techniques used are quite general. We’ve used similar approaches to understand government and newspaper websites.

Asus T100 Transformer

t100_edition_10sI’ve gone and bought another toy!

The Asus T100 Transformer is a full Windows 8.1 machine in a 10" form factor which will "transform" from a dinky notebook format to a freestanding tablet – all of the gubbins are in the display. I paid £309 for the 32GB 2014 model which has a slightly more powerful processor than the 2013 model.

The T100 really is a proper Windows 8.1 machine, only tiny. It includes Microsoft Office which works just as you would expect, and I installed Python(x,y) which is a moderate size install which I’d expect to fail on a system which wasn’t genuine, full Windows. I’ve also installed Picasa, my favourite photo collection software and that just works too.

The performance is pretty good for such a small package, things got a bit laggy when I ran a 1920×1080 display over the mini-HDMI port but not unusably so and that seemed to be more a display drive problem than a processor problem.

The modern OS experience differs from what went before, I used my Microsoft ID when setting up, and as if by magic my personal settings appeared on the T100 – including my familiar desktop wallpaper and the few apps I installed from the Windows app store. The same goes for Google Chrome – my default browser – once it knows who you are all your settings appear as if by magic.

I wrote a while back when I got my Sony Vaio that it seemed like Windows 8 was designed for the tablet form factor. And it sort of is. But I have the same feeling moving from my (Android) Nexus 7 to a Windows 8 tablet as I do when I move from a Windows 8 machine to a Mac. The new place is all very nice, and I’m sure I’d settle in eventually but it’s not the same. Windows 8 is still trying to be a desktop OS and a touch OS, and that just doesn’t work very well.

The T100 hardware is OK, the display looks fine and the latch/unlatch mechanism feels sturdy but the keyboard is a bit rattly. I would have liked to have had a more prominent "Windows” button on the display part in the style of an Android tablet. As it is there are three anonymous buttons on the display whose functionality I forget. Attaching and removing the keyboard kept me amused for a good half hour, the mechanism is reassuringly sturdy.

For me this form factor doesn’t really fit. I have a Nexus 7 tablet, which is lighter than the T100 for reading Kindle books on, Chromecasting to the TV (which I can’t do on the T100), browsing the internet or catching up on twitter. I have a Sony Vaio T13 ultrabook which is more useable as a laptop with it’s 13” display but is only a bit heavier. I’ve discovered I don’t need something of intermediate size!

Interestingly I have noted that I hold my 4 inch Nexus phone and 7 inch Nexus tablet at a distance such that their displays seem the same size, to match this feat with the T100 I would need arms like a gibbon! 

I can see the T100 working as a travel system for someone with a chunky laptop or desktop, or as a tablet. It’s nice to have a backup machine for work and home.

I’m intrigued by the idea of installing Ubuntu on this machine, I have it in a virtual machine on my Sony Vaio, the process is described here but it’s a bit fiddly launching Windows and then the VM and the performance isn’t great. I find extensive instructions for installing Ubuntu on the T100 here, they look lengthy!

In summary, impressive to get a Windows laptop in such a small form factor and for such a reasonable price but it doesn’t really fit with my current devices except as a backup.

Book review: Pompeii by Mary Beard

For a change I have been reading about Roman history, in the form of Pompeii: The Life of a Roman Town by Mary Beard.

Mary Beard is a Cambridge classicist. I think it helps having seen her on TV, jabbing her figure at a piece of Roman graffiti, explaining what it meant and why it was important with obvious enthusiasm. For me it gave the book a personality.

I imagine I am not unusual in gaining my knowledge of Roman culture via some poorly remembered caricature presented in pre-16 history classes at school and films including the Life of Brian, Gladiator and Up Pompeii.

Pompeii is an ancient Italian town which was covered in a 4-6 metre blanket of ash by an eruption of nearby Vesuvius in 79 AD. Beneath the ash the town was relatively undamaged. It was rediscovered in 1599 but excavations only started in the mid 18th century. These revealed a very well-preserved town including much structure, artwork and the remains of the residents. The bodies of the fallen left voids in the ash which were reconstructed by filling them with plaster.

The book starts with a salutatory reminder that Pompeii wasn’t a town frozen in normal times but one in extremis as it succumbed to a volcanic eruption. We can’t assume that the groups of bodies found or the placement of artefacts represent how they might have been found in normal daily life.

There are chapters on the history of the city, the streets, homes, painting, occupations, administration, various bodily pleasures (food, wine, sex and bathing), entertainment (theatre and gladiators) and temples.

I’ve tended to think of the Roman’s as a homogeneous blob who occupied a chunk of time and space. But this isn’t the case, the pre-Roman history of the town features writing in the Oscan language. The Greek writer Strabo, working in the first century BC wrote about a sequence of inhabitants: Oscans, Etruscans, Pelasgians and then Samnites – who also spoke Oscan.

Much of what we know of Pompeii seems to stem from the graffiti found all about the remains. It would be nice to learn a bit more about this evidence since it seems important, and clearly something different is going on from what we find in modern homes and cities. If I look around homes I know today then none feature graffiti, granted there is much writing on paper but not on the walls.

From the depths of my memory I recall the naming of various rooms in the Roman bath house but it turns out these names may not have been in common usage amongst the Romans. Furthermore, the regimented progression from hottest to coldest bath may also be somewhat fanciful. Something I also didn’t appreciate was that the meanings of some words in ancient Latin are not known, or are uncertain. It’s obvious in retrospect that this might be the case but caveats on such things are rarely heard.

Beard emphasises that there has been a degree of “over-assumption” in the characterisation of the various buildings in Pompeii. For instance on some reckonings there are huge numbers of bars and brothels. So for instance, anything with a counter and some storage jars gets labelled a bar. Anything with phallic imagery gets labelled a brothel, the Pompeiian’s were very fond of phallic imagery. A more conservative treatment brings these numbers down enormously.

I am still mystified by the garum, the fermented fish sauce apparently loved by many, it features moderately in the book since the house of a local manufacturer is one of the better preserved ones, and one which features very explicit links to his trade. It sounds absolutely repulsive.

The degree of preservation in Pompeii is impressive, the scene that struck me most vividly was in The House of Painters at Work. In this case the modern label for the house describes exactly what was going on, other houses are labelled with the names of dignitaries present when a house was uncovered, or after key objects found in the house. It is not known what the inhabitants called the houses, or even the streets. Deliveries seemed to go by proximity to prominent buildings.

I enjoyed Pompeii, the style is readable and it goes to some trouble to explain the uncertainty and subtlety in interpreting ancient remains.

Once again I regret buying a non-fiction book in ebook form, the book has many illustrations including a set of colour plates and I still find it clumsy looking at them in more detail or flicking backwards and forwards in an ereader.

Book review: Graph Theory and Complex Networks by Maarten van Steen

graph_theory

This review was first published at ScraperWiki.

My last read, on the Gephi graph visualisation package, was a little disappointing but gave me an enthusiasm for Graph Theory. So I picked up one of the books that it recommended: Graph Theory and Complex Networks: An Introduction by Maarten van Steen to learn more. In this context a graph is a collection of vertices connected by edges, the edges may be directed or undirected. The road network is an example of a graph; the junctions between roads are vertices, the edges are roads and a one way street is a directed edge – two-way streets are undirected.

Why study graph theory?

Graph theory underpins a bunch of things like route finding, timetabling, map colouring, communications routing, sol-gel transitions, ecologies, parsing mathematical expressions and so forth. It’s been a staple of Computer Science undergraduate courses for a while, and more recently there’s been something of a resurgence in the field with systems on the web provided huge quantities of graph-shaped data both in terms of the underlying hardware networks and the activities of people – the social networks.

Sometimes the links between graph theory and an application are not so obvious. For example, project planning can be understood in terms of graph theory. A task can depend on another task – the tasks being two vertices in a graph. The edge between such vertices is directed, from one to the other, indicating dependency. To give a trivial example: you need a chicken to lay an egg. As a whole a graph of tasks cannot contain loops (or cycles) since this would imply that a task depended on a task that could only be completed after it, itself had been completed. To return to my example: if you need an egg in order to get a chicken to lay an egg then you’re in trouble! Generally, networks of tasks should be directed acyclic graphs (or DAG) i.e. they should not contain cycles.

The book’s target audience is 1st or 2nd year undergraduates with moderate background in mathematics, it was developed for Computer Science undergraduates. The style is quite mathematical but fairly friendly. The author’s intention is to introduce the undergraduate to mathematical formalism. I found this useful, since mathematical symbols are difficult to search for and shorthands such  as operator overloading even more so. This said, it is still an undergraduate text rather than a popular accounts don’t expect an easy read or pretty pictures, or even pretty illustrations.

The book divides into three chunks. The first provides the basic language for describing graphs, both words and equations. The second part covers theorems arising from some of the basic definitions, including the ideas of “walks” – traversals of a graph which take in all vertices and “tours” which take in all edges. This includes long standing problems such as the Dijkstra’s algorithm for route finding, and the travelling salesman problem. Also included in this section are “trees” – networks with no cycles – where is a cycle is a closed walk which visits vertices just once.

The third section covers the analysis of graphs. This starts with metrics for measuring graphs such as vertex degree distributions, distance statistics and clustering measures. I found this section rather brief, and poorly illustrated. However, it is followed by an introduction to various classes of complex networks including the original random graphs(connect), small-world and scale-free networks. What is stuck me about complex graphs is that they are each complex in their own way. Random, small-world and scale-free networks are all methods for constructing a network in order to try to represent a known real world situation. Small-world networks arise from one of Stanley Milgram’s experiments: sending post across the US via social networks. The key feature is that there are clusters of people who know each other but these clusters are linked by the odd “longer range” contact.

The book finishes with some real world examples relating to the world wide web, peer-to-peer sharing algorithms and social networks. What struck me in social networks is that the vertices (people!) you identify as important can depend quite sensitively on the metric you use to measure importance.

I picked up Graph Theory after I’d been working with Gephi, wanting to learn more about the things that Gephi will measure for me. It serves that purpose pretty well. In addition I have a better feel for situations where the answer is “graph theory”. Furthermore, Gephi has a bunch of network generators to create random, small-world and scale-free networks so that you can try out what you’ve learned.