SomeBeans

Nov 15 2014

Landing on a comet

By SomeBeans in Science

There was a striking contrast in the office on Friday between the former practicing scientists and the developers, with an open data background, who were bemoaning the slowness with which results were being reported by the ESA team looking after the Rosetta orbiter and the Philae lander.

As I pointed out, many years ago I sat in an instrument hutch at the Rutherford Appleton Laboratory trying to work out what the hell was going on with my experiment downstairs, behind an interlocked door, being flooded with a beam of invisible neutrons. It was possible that I was discovering new and important science. But on the other hand it was also possible that the goniometer third from the left on my sample changer was up to its usual tricks and had failed to move when instructed. We couldn’t tell from the frankly inadequate early nineties video feed. The only way to find out was to wait for the experiment to finish and go and eyeball the damn thing.

Goniometer number 3 had failed, again.

Peter, who did his PhD at CERN, replied – “what he said!”

2 comments

Nov 12 2014

Book review: Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson

By SomeBeans in Book Reviews

This review was first published at Scraperwiki.

I came to databases a little late in life, as a physical scientist I didn’t have much call for them. Then a few years ago I discovered the wonders of relational databases and the power of SQL. The ScraperWiki platform strongly encourages you to save data to SQLite databases to integrate with its tools.

There is life beyond SQL databases much of it evolved in the last few years. I wanted to learn more and a plea on twitter quickly brought me a recommendation for Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson.

The book covers the key classes of database starting with relational databases in the form of PostgreSQL. It then goes on to look at six further databases in the so-called NoSQL family – all relatively new compared to venerable relational databases. The six other databases fall into several classes: Riak and Redis are key-value stores, CouchDB and MongoDB are document databases, HBase is a columnar database and Neo4J is a graph database.

Relational databases are characterised by storage schemas involving multiple interlinked tables containing rows and columns, this layout is designed to minimise the repetition of data and to provide maximum query-ability. Key-value stores only store a key and a value in the manner of a dictionary but the “value” may be of a complex type. A value can be returned very fast given a key – this is the core strength of the key-value stores. The document stores MongoDB and CouchDB store JSON “documents” rather than rows. These documents can store information in nested hierarchies which don’t necessarily need to all have the same structure this allows maximum flexibility in the type of data to be stored but at the cost of ease of query.

HBase fits into the Hadoop ecosystem, the language used to describe it looks superficially like that used to describe tables in a relational database but this is a bit misleading. HBase is designed to work with massive quantities of data but not necessarily give the full querying flexibility of SQL. Neo4J is designed to store graph data – collections of nodes and edges and comes with a query language particularly suited to querying (or walking) data so arranged. This seems very similar to triplestores and the SPARQL – used in semantic web technologies.

Relational databases are designed to give you ACID (Atomicity, Consistency, Isolation, Durability), essentially you shouldn’t be able to introduce inconsistent changes to the database and it should always give you the same answer to the same query. The NoSQL databases described here have a subtly different core goal. Most of them are designed to work on the web and address CAP (Consistency, Availability, Partition), indeed several of them offer native REST interfaces over HTTP which means they are very straightforward to integrate into web applications. CAP refers to the ability to return a consistent answer, from any instance of the database, in the face of network (or partition) problems. This assumes that these databases may be stored in multiple locations on the web. A famous theorem contends that you can have any two of Consistency, Availability and Partition resistance at any one time but not all three together.

NoSQL databases are variously designed to scale horizontally and vertically. Horizontal scaling means replicating the same database in multiple places to provide greater capacity to serve requests even with network connectivity problems. Vertically scaling by “sharding” provides the ability to store more data by fragmenting the data such that some items are stored on one server and some on another.

I’m not a SQL expert by any means but it’s telling that I learnt a huge amount about PostgreSQL in the forty or so pages on the database. I think this is because the focus was not on the SQL query language but rather on the infrastructure that PostgreSQL provides. For example, it discusses triggers, rules, plugins and specialised indexing for text search. I assume this style of coverage applies to the other databases. This book is not about the nitty-gritty of querying particular database types but rather about the different database systems.

The NoSQL databases generally support MapReduce style queries this is a scheme most closely associated with Big Data and the Hadoop ecosystem but in this instance it is more a framework for doing queries which maybe executed across a cluster of computers.

I’m on a bit of a graph theory binge at the moment so Neo4J was the most interesting to me.

As an older data scientist I have a certain fondness for things that have been around for a while, like FORTRAN and SQL databases, I’ve looked with some disdain at these newfangled NoSQL things. To a degree this book has converted me, at least to the point where I look at ScraperWiki projects and think – “It might be better to use a * database for this piece of work”.

This is an excellent book which was pitched at just the right level for my purposes, I’ll be looking for more Pragmatic Programmers books in future.

data science, scraperwiki

Oct 28 2014

Exploring the ONS

By SomeBeans in Technology

This post was first published at ScraperWiki.

The Office for National Statistics (ONS) is the United Kingdom statistical body charged by the government with the task of collecting and publishing statistics related to the economy, population and society of England and Wales at national, regional and local levels. The data is typically published in the form of Excel spreadsheets.

The ONS is working on opening up their data, and making it more accessible to users. We’ve been doing a bit of work to help with that. This is typical of a number of jobs we have done. A customer has a website containing content which they want to move/process/republish elsewhere. The current website might have been built by aggregation over a number of years, and the underlying structure of the Content Management System may not be available to them. In these circumstances making a survey of the pre-existing content is an obvious first step.

The index for the ONS reference tables and datasets can be found here. Each dataset has a title, a release date, and the type of dataset. There is also a URL to the dataset inside the title field there is an indication of the size of the file. We wrote a simple scraper to collect these pieces of information.

First up, we’ll looking at the topics of the data released. There are a couple of routes into discovering these, one is to read the titles, this is OK as an approach but the titles are quite wordy and sometimes it isn’t clear what they refer to. An alternative, in this case, is to look at the URLs to the documents.They look something like this:

http://www.ons.gov.uk/ons/rel/lms/labour-market-statistics/november-2013/table-unem03.xls

This can be quite revealing since even if the website is not explicit about its structure the URLs can reveal the structure the builder used. We process the URLs by splitting them at the backslashes. The first part http://www.ons.gov.uk/ons/rel is common to all the URLs. Subsequent parts we can use to define a hierarchy. In this case we will focus on the fourth part of the hierarchy – “labour-market-statistics” in this instance, this gives us a human readable description of a topic. There are approximately 400 topics as defined by this metric as opposed to 90 or so defined by the third level of the hierarchy. Using the fourth level of the hierarchy key areas of the website by numbers of documents are:

Labour market statistics
National population projections
Family spending
Subnational labour market statistics
Census
Annual survey of hours and earnings.

We can visualise this as a treemap, here I am simply showing the top 20 areas by number of documents:

These 20 topics cover approximately the two thirds of the total number of documents.

We can identify file types using the file extension in the URL, this approach needs to be used a little cautiously since sometimes the extension doesn’t match the file type. Most of the files are Excel spreadsheets although there are a few CSV and zip files, the zip files containing Excel spreadsheets. CSV appears to have been used for some of the older datasets. Most of the files are pretty small, less than 290kb but there are a few up to much larger sizes.

Finally we can look at the release dates for the datasets. There are datasets from as far back as 1988, in fact the data set released in 1988 actually refers to data from 1984. There are some data released regularly from about 2001 but from 2011 a wider range of data has been released on a regular basis. We can see the monthly pattern of data releases in this timeline for 2014 which is restricted to the top 20 topics identified above:

This shows the big releases of labour market statistics, both national and regional, on the third Wednesday of each month. Other monthly releases include retail sales and producer prices data. And every week provisional figures on the registration of deaths in England and Wales are reported.

You can explore these data yourself using the Tableau workbook here.

The actual content of these spreadsheets is another story.

This survey approach to a website is handy for a range of applications, and the techniques used are quite general. We’ve used similar approaches to understand government and newspaper websites.

data science, scraperwiki

Oct 27 2014

Asus T100 Transformer

By SomeBeans in Technology

I’ve gone and bought another toy!

The Asus T100 Transformer is a full Windows 8.1 machine in a 10" form factor which will "transform" from a dinky notebook format to a freestanding tablet – all of the gubbins are in the display. I paid £309 for the 32GB 2014 model which has a slightly more powerful processor than the 2013 model.

The T100 really is a proper Windows 8.1 machine, only tiny. It includes Microsoft Office which works just as you would expect, and I installed Python(x,y) which is a moderate size install which I’d expect to fail on a system which wasn’t genuine, full Windows. I’ve also installed Picasa, my favourite photo collection software and that just works too.

The performance is pretty good for such a small package, things got a bit laggy when I ran a 1920×1080 display over the mini-HDMI port but not unusably so and that seemed to be more a display drive problem than a processor problem.

The modern OS experience differs from what went before, I used my Microsoft ID when setting up, and as if by magic my personal settings appeared on the T100 – including my familiar desktop wallpaper and the few apps I installed from the Windows app store. The same goes for Google Chrome – my default browser – once it knows who you are all your settings appear as if by magic.

I wrote a while back when I got my Sony Vaio that it seemed like Windows 8 was designed for the tablet form factor. And it sort of is. But I have the same feeling moving from my (Android) Nexus 7 to a Windows 8 tablet as I do when I move from a Windows 8 machine to a Mac. The new place is all very nice, and I’m sure I’d settle in eventually but it’s not the same. Windows 8 is still trying to be a desktop OS and a touch OS, and that just doesn’t work very well.

The T100 hardware is OK, the display looks fine and the latch/unlatch mechanism feels sturdy but the keyboard is a bit rattly. I would have liked to have had a more prominent "Windows” button on the display part in the style of an Android tablet. As it is there are three anonymous buttons on the display whose functionality I forget. Attaching and removing the keyboard kept me amused for a good half hour, the mechanism is reassuringly sturdy.

For me this form factor doesn’t really fit. I have a Nexus 7 tablet, which is lighter than the T100 for reading Kindle books on, Chromecasting to the TV (which I can’t do on the T100), browsing the internet or catching up on twitter. I have a Sony Vaio T13 ultrabook which is more useable as a laptop with it’s 13” display but is only a bit heavier. I’ve discovered I don’t need something of intermediate size!

Interestingly I have noted that I hold my 4 inch Nexus phone and 7 inch Nexus tablet at a distance such that their displays seem the same size, to match this feat with the T100 I would need arms like a gibbon!

I can see the T100 working as a travel system for someone with a chunky laptop or desktop, or as a tablet. It’s nice to have a backup machine for work and home.

I’m intrigued by the idea of installing Ubuntu on this machine, I have it in a virtual machine on my Sony Vaio, the process is described here but it’s a bit fiddly launching Windows and then the VM and the performance isn’t great. I find extensive instructions for installing Ubuntu on the T100 here, they look lengthy!

In summary, impressive to get a Windows laptop in such a small form factor and for such a reasonable price but it doesn’t really fit with my current devices except as a backup.

Oct 23 2014

Book review: Pompeii by Mary Beard

By SomeBeans in Book Reviews

For a change I have been reading about Roman history, in the form of Pompeii: The Life of a Roman Town by Mary Beard.

Mary Beard is a Cambridge classicist. I think it helps having seen her on TV, jabbing her figure at a piece of Roman graffiti, explaining what it meant and why it was important with obvious enthusiasm. For me it gave the book a personality.

I imagine I am not unusual in gaining my knowledge of Roman culture via some poorly remembered caricature presented in pre-16 history classes at school and films including the Life of Brian, Gladiator and Up Pompeii.

Pompeii is an ancient Italian town which was covered in a 4-6 metre blanket of ash by an eruption of nearby Vesuvius in 79 AD. Beneath the ash the town was relatively undamaged. It was rediscovered in 1599 but excavations only started in the mid 18th century. These revealed a very well-preserved town including much structure, artwork and the remains of the residents. The bodies of the fallen left voids in the ash which were reconstructed by filling them with plaster.

The book starts with a salutatory reminder that Pompeii wasn’t a town frozen in normal times but one in extremis as it succumbed to a volcanic eruption. We can’t assume that the groups of bodies found or the placement of artefacts represent how they might have been found in normal daily life.

There are chapters on the history of the city, the streets, homes, painting, occupations, administration, various bodily pleasures (food, wine, sex and bathing), entertainment (theatre and gladiators) and temples.

I’ve tended to think of the Roman’s as a homogeneous blob who occupied a chunk of time and space. But this isn’t the case, the pre-Roman history of the town features writing in the Oscan language. The Greek writer Strabo, working in the first century BC wrote about a sequence of inhabitants: Oscans, Etruscans, Pelasgians and then Samnites – who also spoke Oscan.

Much of what we know of Pompeii seems to stem from the graffiti found all about the remains. It would be nice to learn a bit more about this evidence since it seems important, and clearly something different is going on from what we find in modern homes and cities. If I look around homes I know today then none feature graffiti, granted there is much writing on paper but not on the walls.

From the depths of my memory I recall the naming of various rooms in the Roman bath house but it turns out these names may not have been in common usage amongst the Romans. Furthermore, the regimented progression from hottest to coldest bath may also be somewhat fanciful. Something I also didn’t appreciate was that the meanings of some words in ancient Latin are not known, or are uncertain. It’s obvious in retrospect that this might be the case but caveats on such things are rarely heard.

Beard emphasises that there has been a degree of “over-assumption” in the characterisation of the various buildings in Pompeii. For instance on some reckonings there are huge numbers of bars and brothels. So for instance, anything with a counter and some storage jars gets labelled a bar. Anything with phallic imagery gets labelled a brothel, the Pompeiian’s were very fond of phallic imagery. A more conservative treatment brings these numbers down enormously.

I am still mystified by the garum, the fermented fish sauce apparently loved by many, it features moderately in the book since the house of a local manufacturer is one of the better preserved ones, and one which features very explicit links to his trade. It sounds absolutely repulsive.

The degree of preservation in Pompeii is impressive, the scene that struck me most vividly was in The House of Painters at Work. In this case the modern label for the house describes exactly what was going on, other houses are labelled with the names of dignitaries present when a house was uncovered, or after key objects found in the house. It is not known what the inhabitants called the houses, or even the streets. Deliveries seemed to go by proximity to prominent buildings.

I enjoyed Pompeii, the style is readable and it goes to some trouble to explain the uncertainty and subtlety in interpreting ancient remains.

Once again I regret buying a non-fiction book in ebook form, the book has many illustrations including a set of colour plates and I still find it clumsy looking at them in more detail or flicking backwards and forwards in an ereader.

I've worked as a scientist for the last 30 years, at various universities, a large home and personal care company, a startup in Liverpool called The Sensible Code Company (formerly ScraperWiki Ltd), GBG and now as a consultant in data science.

I write about:
* the books I have read, typically science and history (or both), partly as a reminder to myself and partly as a review;
* science, things I have done or things I find interesting;
* technology, programming and gadgets;
politics, and current affairs;
* ...and other stuff as it takes my fancy - holidays, photographs and things I want to remember.

Landing on a comet

Book review: Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson

Exploring the ONS

Asus T100 Transformer

Book review: Pompeii by Mary Beard

About

Recent Posts

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers

Landing on a comet

Book review: Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson

Exploring the ONS

Asus T100 Transformer

Book review: Pompeii by Mary Beard

About

Recent Posts

Tags

Categories

Blog Archive

Goodreads

Gardening

History

Politics

Science

Writers