Category: Technology

Programming, gadgets (reviews thereof) and computers

NewsReader – the developers story

newsreader

This post was first published at ScraperWiki.

ScraperWiki has been a partner in NewsReader, an EU Framework 7 research project, for the last couple of years. The aim of NewsReader is to give computers the power to “understand” the news; to extract from a myriad of news articles the underlying events which gave rise to those articles; the who, the where, the why and the what of those events. The project is comprised of academic researchers specialising in computational linguistics (VUA in Amsterdam, EHU in the Basque Country and FBK in Trento), Lexis Nexis – a major news aggregator, and a couple of small technology companies: ourselves at ScraperWiki and SynerScope – a Dutch startup specialising in the visualisation of complex networks.

Our role at ScraperWiki is in providing mechanisms to enable developers to exploit the NewsReader technology, and to feed news into the system. As part of this work we have developed a simple REST API which gives access to the KnowledgeStore, the system which underpins NewsReader. The native query language of the KnowledgeStore is SPARQL – the query language of the semantic web. The Simple API provides a set of predefined queries which are easier for end users to work with than raw SPARQL, and help us as service managers by providing a predictable set of optimised queries. If you want to know more technical detail then we’ve written a paper about it (here).

The Simple API has seen live action at a Hack Day on World Cup news which we held in London in the summer. Attendees were able to develop a range of applications which probed violence, money and corruption in the realm of the World Cup. I blogged about our previous Hack Day here and here. The Simple API, and the Hack Day helped us shake out some bugs and add features which will make it even better next time.

“Next time” is another Hack Day to be held in the Amsterdam on 21st January 2015, and London on the 30th January 2015. This time we have processed 6,000,000 articles relating to the car industry over the period 2005-2014. The motor industry is a trillion dollar a year business, so we can anticipate finding lots of valuable information in this horde.

From our previous experience the three things that NewsReader excels at are:

  1. Finding networks of interactions, identifying important players. For the World Cup Hack Day we at ScraperWiki were handicapped slightly by having no interest in football! But the NewsReader technology enabled us to quickly identify that “Sepp Blatter”, “Jack Warner” and “Mohammed bin Hammam” were important in world football. This is illustrated in this slightly cryptic visualisation made using Gephi:beckham_and_blatter
  2. Finding events of a particular type. the NewsReader technology carries out semantic role labeling: taking sentences and identifying what type of event is described in that sentence and what roles the participants took. This information is then aggregated and exposed using semantic web technology. In the World Cup Hack Day participants used this functionality to identify events involving violence, bribery, gambling, and other financial transactions;
  3. Establishing timelines. In the World Cup data we could track the events involving “Mohammed bin Hammam” through time and the type of events he was involved in. This enabled us to quickly navigate to pertinent news articles.Timeline

You can see fragments of code used to extract these data using the Simple API in these GitHub Gists (here and here), and dynamic visualisations illustrating these three features here and here.

The Simple API is up and running already, you can find it (here). It is self-documenting, simply visit the root URL and you’ll see query examples with optional and compulsory parameters. Be aware though: the Simple API is under active development, and the underlying data in the KnowledgeStore is being optimised for the Hack Days so it may not be available when you visit.

If you want to join our automotive Hack Day then you can sign up for the Amsterdam event (here) and the London event (here).

Exploring the ONS

This post was first published at ScraperWiki.

The Office for National Statistics (ONS) is the United Kingdom statistical body charged by the government with the task of collecting and publishing  statistics related to the economy, population and society of England and Wales at national, regional and local levels. The data is typically published in the form of Excel spreadsheets.

The ONS is working on opening up their data, and making it more accessible to users. We’ve been doing a bit of work to help with that. This is typical of a number of jobs we have done. A customer has a website containing content which they want to move/process/republish elsewhere. The current website might have been built by aggregation over a number of years, and the underlying structure of the Content Management System may not be available to them. In these circumstances making a survey of the pre-existing content is an obvious first step.

The index for the ONS reference tables and datasets can be found here. Each dataset has a title, a release date, and the type of dataset. There is also a URL to the dataset inside the title field there is an indication of the size of the file. We wrote a simple scraper to collect these pieces of information.

First up, we’ll looking at the topics of the data released. There are a couple of routes into discovering these, one is to read the titles, this is OK as an approach but the titles are quite wordy and sometimes it isn’t clear what they refer to. An alternative, in this case, is to look at the URLs to the documents.They look something like this:

http://www.ons.gov.uk/ons/rel/lms/labour-market-statistics/november-2013/table-unem03.xls

This can be quite revealing since even if the website is not explicit about its structure the URLs can reveal the structure the builder used. We process the URLs by splitting them at the backslashes. The first part http://www.ons.gov.uk/ons/rel is common to all the URLs. Subsequent parts we can use to define a hierarchy. In this case we will focus on the fourth part of the hierarchy – “labour-market-statistics” in this instance, this gives us a human readable description of a topic. There are approximately 400 topics as defined by this metric as opposed to 90 or so defined by the third level of the hierarchy. Using the fourth level of the hierarchy key areas of the website by numbers of documents are:

  • Labour market statistics
  • National population projections
  • Family spending
  • Subnational labour market statistics
  • Census
  • Annual survey of hours and earnings.

We can visualise this as a treemap, here I am simply showing the top 20 areas by number of documents:

ONS-treemap-(simple)-v2

These 20 topics cover approximately the two thirds of the total number of documents.

We can identify file types using the file extension in the URL, this approach needs to be used a little cautiously since sometimes the extension doesn’t match the file type. Most of the files are Excel spreadsheets although there are a few CSV and zip files, the zip files containing Excel spreadsheets. CSV appears to have been used for some of the older datasets. Most of the files are pretty small, less than 290kb but there are a few up to much larger sizes.

Finally we can look at the release dates for the datasets. There are datasets from as far back as 1988, in fact the data set released in 1988 actually refers to data from 1984. There are some data released regularly from about 2001 but from 2011 a wider range of data has been released on a regular basis. We can see the monthly pattern of data releases in this timeline for 2014 which is restricted to the top 20 topics identified above:

Timeline-(detail)-v2

This shows the big releases of labour market statistics, both national and regional, on the third Wednesday of each month. Other monthly releases include retail sales and producer prices data. And every week provisional figures on the registration of deaths in England and Wales are reported.

You can explore these data yourself using the Tableau workbook here.

The actual content of these spreadsheets is another story.

This survey approach to a website is handy for a range of applications, and the techniques used are quite general. We’ve used similar approaches to understand government and newspaper websites.

Asus T100 Transformer

t100_edition_10sI’ve gone and bought another toy!

The Asus T100 Transformer is a full Windows 8.1 machine in a 10" form factor which will "transform" from a dinky notebook format to a freestanding tablet – all of the gubbins are in the display. I paid £309 for the 32GB 2014 model which has a slightly more powerful processor than the 2013 model.

The T100 really is a proper Windows 8.1 machine, only tiny. It includes Microsoft Office which works just as you would expect, and I installed Python(x,y) which is a moderate size install which I’d expect to fail on a system which wasn’t genuine, full Windows. I’ve also installed Picasa, my favourite photo collection software and that just works too.

The performance is pretty good for such a small package, things got a bit laggy when I ran a 1920×1080 display over the mini-HDMI port but not unusably so and that seemed to be more a display drive problem than a processor problem.

The modern OS experience differs from what went before, I used my Microsoft ID when setting up, and as if by magic my personal settings appeared on the T100 – including my familiar desktop wallpaper and the few apps I installed from the Windows app store. The same goes for Google Chrome – my default browser – once it knows who you are all your settings appear as if by magic.

I wrote a while back when I got my Sony Vaio that it seemed like Windows 8 was designed for the tablet form factor. And it sort of is. But I have the same feeling moving from my (Android) Nexus 7 to a Windows 8 tablet as I do when I move from a Windows 8 machine to a Mac. The new place is all very nice, and I’m sure I’d settle in eventually but it’s not the same. Windows 8 is still trying to be a desktop OS and a touch OS, and that just doesn’t work very well.

The T100 hardware is OK, the display looks fine and the latch/unlatch mechanism feels sturdy but the keyboard is a bit rattly. I would have liked to have had a more prominent "Windows” button on the display part in the style of an Android tablet. As it is there are three anonymous buttons on the display whose functionality I forget. Attaching and removing the keyboard kept me amused for a good half hour, the mechanism is reassuringly sturdy.

For me this form factor doesn’t really fit. I have a Nexus 7 tablet, which is lighter than the T100 for reading Kindle books on, Chromecasting to the TV (which I can’t do on the T100), browsing the internet or catching up on twitter. I have a Sony Vaio T13 ultrabook which is more useable as a laptop with it’s 13” display but is only a bit heavier. I’ve discovered I don’t need something of intermediate size!

Interestingly I have noted that I hold my 4 inch Nexus phone and 7 inch Nexus tablet at a distance such that their displays seem the same size, to match this feat with the T100 I would need arms like a gibbon! 

I can see the T100 working as a travel system for someone with a chunky laptop or desktop, or as a tablet. It’s nice to have a backup machine for work and home.

I’m intrigued by the idea of installing Ubuntu on this machine, I have it in a virtual machine on my Sony Vaio, the process is described here but it’s a bit fiddly launching Windows and then the VM and the performance isn’t great. I find extensive instructions for installing Ubuntu on the T100 here, they look lengthy!

In summary, impressive to get a Windows laptop in such a small form factor and for such a reasonable price but it doesn’t really fit with my current devices except as a backup.

Of Matlab and Python

I’ve been a scientist and data analyst for nearly 25 years. Originally as an academic physicist, then as a research scientist in a large fast moving consumer goods company and now at a small technology company in Liverpool. In common to many scientists of my age I came to programming in the early eighties when a whole variety of home computers briefly flourished. My first formal training in programming was FORTRAN after which I have made my own way.

I came to Matlab in the late nineties, frustrated by the complexities of producing a smooth workflow with FORTRAN involving interaction, analysis and graphical output.

Matlab is widely used in academic circles and a number of industries because it provides a great deal of analytical power in a user-friendly environment. Its notation for handling matrix (array) calculations is slick. Its functionality is extended by a range of toolboxes, and there is a community of scientists sharing new functionality. It shares this feature set with systems such as IDL and PV-WAVE.

However, there are a number of issues with Matlab:

  • as a programming language it has the air of new things being botched onto a creaking frame. Support for unit testing is an afterthought, there is some integration of source control into the Matlab environment but it is with Source Safe. It doesn’t support namespaces. It doesn’t support common data structures such as dictionaries, lists and sets.
  • The toolbox ecosystem is heavily focused on scientific applications, generally in the physical sciences. So there is no support for natural language processing, for example, or building a web application based on the powerful analysis you can do elsewhere in the ecosystem;
  • the licensing is a nightmare. Once you’ve got core Matlab additional toolboxes containing really useful functionality (statistics, database connections, a “compiler”) are all at an additional cost. You can investigate pricing here. In my experience you often find yourself needing a toolbox for just a couple of functions. For an academic things are a bit rosier, universities get lower price licenses and the process by which this is achieved is opaque to end-users. As an industrial user, involved in the licensing process, it is as bad as line management and sticking needles in your eyes in the “not much fun thing to do” stakes;
  • running Matlab with network licenses means that your code may stop running part way through because you’ve made a call to a function to which you can’t currently get the license. It is difficult to describe the level of frustration and rage this brings. Now of course one answer is to buy individual licenses for all, or at least a significant surplus of network licenses. But tell that to the budget holder particularly when you wanted to run the analysis today. The alternative is to find one of the license holders of the required toolbox and discover if they are actually using it or whether they’ve gone off for a three hour meeting leaving Matlab open;
  • deployment to users who do not have Matlab is painful. They need to download a more than 500MB runtime, of exactly the right version and the likelihood is they will be installing it just for your code;

I started programming in Python at much the same time as I started on Matlab. At the time I scarcely used it for analysis but even then when I wanted to parse the HTML table of contents for Physical Review E, Python was the obvious choice. I have written scrapers in Matlab but it involved interfering with the Java underpinnings of the language.

Python has matured since my early use. It now has a really great system of libraries which can be installed pretty much trivially, they extend far beyond those offered by Matlab. And in my view they are of very good quality. Innovation like IPython notebooks take the Matlab interactive style of analysis and extend it to be natively web-based. If you want a great example of this, take a look at the examples provided by Matthew Russell for his book, Mining the Social Web.

Python is a modern language undergoing slow, considered improvement. That’s to say it doesn’t carry a legacy stretching back decades and changes are small, and directed towards providing a more consistent language. Its used by many software developers who provide a source of help, support and an impetus for an decent infrastructure.

Ubuntu users will find Python pre-installed. For Windows users, such as myself, there are a number of distributions which bundle up a whole bunch of libraries useful for scientists and sometimes an IDE. I like python(x,y). New libraries can generally be installed almost trivially using the pip package management system. I actually use Python in Ubuntu and Windows almost equally often. There are a small number of libraries which are a bit more tricky to install in Windows – experienced users turn to Christoph Gohlke’s fantastic collection of precompiled binaries.

In summary, Matlab brought much to data analysis for scientists but its time is past. An analysis environment built around Python brings wider functionality, a better coding infrastructure and freedom from licensing hell.

Inordinately fond of beetles… reloaded!

sciencemuseum_logo

This post was first published at ScraperWiki.

Some time ago, in the era before I joined ScraperWiki I had a play with the Science Museums object catalogue. You can see my previous blog post here. It was at a time when I was relatively inexperienced with the Python programming language and had no access to Tableau, the visualisation software. It’s a piece of work I like to talk about when meeting customers since it’s interesting and I don’t need to worry about commercial confidentiality.

The title comes from a quote by J.B.S. Haldane, who was asked what his studies in biology had told him about the Creator. His response was that, if He existed then he was “inordinately fond of beetles”.

The Science Museum catalogue comprises three CSV files containing information on objects, media and events. I’m going to focus on the object catalogue since it’s the biggest one by a large margin – 255,000 objects in a 137MB file. Each object has an ID number which often encodes the year in which the object was added to the collection; a title, some description, it often has an “item name” which is a description of the type of object, there is sometimes information on the date made, the maker, measurements and whether it represents part or all of an object. Finally, the objects are labelled according to which collection they come from and which broad group in that collection, the catalogue contains objects from the Science Museum, Nation Railway Museum and National Media Museum collections.

The problem with most of these fields is that they don’t appear to come from a controlled vocabulary.

Dusting off my 3 year old code I was pleased to discover that the SQL I had written to upload the CSV files into a database worked almost first time, bar a little character encoding. The Python code I’d used to clean the data, do some geocoding, analysis and visualisation was not in such a happy state. Or rather, having looked at it I was not in such a happy state. I appeared to have paid no attention to PEP-8, the Python style guide, no source control, no testing and I was clearly confused as to how to save a dictionary (I pickled it).

In the first iteration I eyeballed the data as a table and identified a whole bunch of stuff I thought I needed to tidy up. This time around I loaded everything into Tableau and visualised everything I could – typically as bar charts. This revealed that my previous clean up efforts were probably not necessary since the things I was tidying impacted a relatively small number of items. I needed to repeat the geocoding I had done. I used geocoding to clean up the place of manufacture field, which was encoded inconsistently. Using the Google API via a Python library I could normalise the place names and get their locations as latitude – longitude pairs to plot on a map. I also made sure I had a link back to the original place name description.

The first time around I was excited to discover the Many Eyes implementation of bubble charts, this time I now realise bubble charts are not so useful. As you can see below in these charts showing the number of items in each subgroup. In a sorted bar chart it is very obvious which subgroup is most common and the relative sizes of the subgroup. I’ve coloured the bars by the major collection to which they belong. Red is the Science Museum, Green is the National Rail Museum and Orange is the National Media Museum.

image

Less discerning members of ScraperWiki still liked the bubble charts.

image

We can see what’s in all these collections from the item name field. This is where we discover that the Science Museum is inordinately fond of bottles. The most common items in the collection are posters, mainly from the National Rail Museum but after that there are bottles, specimen bottles, specimen jars, shops rounds (also bottles), bottle, drug jars, and albarellos (also bottles). This is no doubt because bottles are typically made of durable materials like glass and ceramics, and they have been ubiquitous in many milieu, and they may contain many and various interesting things.

image

Finally I plotted the place made for objects in the collection, this works by grouping objects by location and then finding latitude and longitude for those group location. I then plot a disk sized by the number of items originating at that location. I filtered out items whose place made was simply “England” or “London” since these made enormous blobs that dominated the map.

 

image

 

You can see a live version of these visualisation, and more on Tableau Public.

It’s an interesting pattern that my first action on uploading any data like this to Tableau is to do bar chart frequency plots for each column in the data, this could probably be automated.

In summary, the Science Museum is full of bottles and posters, Tableau wins for initial visualisations of a large and complex dataset.