Dr Administrator

Author's posts

NewsReader – the developers story

newsreader

This post was first published at ScraperWiki.

ScraperWiki has been a partner in NewsReader, an EU Framework 7 research project, for the last couple of years. The aim of NewsReader is to give computers the power to “understand” the news; to extract from a myriad of news articles the underlying events which gave rise to those articles; the who, the where, the why and the what of those events. The project is comprised of academic researchers specialising in computational linguistics (VUA in Amsterdam, EHU in the Basque Country and FBK in Trento), Lexis Nexis – a major news aggregator, and a couple of small technology companies: ourselves at ScraperWiki and SynerScope – a Dutch startup specialising in the visualisation of complex networks.

Our role at ScraperWiki is in providing mechanisms to enable developers to exploit the NewsReader technology, and to feed news into the system. As part of this work we have developed a simple REST API which gives access to the KnowledgeStore, the system which underpins NewsReader. The native query language of the KnowledgeStore is SPARQL – the query language of the semantic web. The Simple API provides a set of predefined queries which are easier for end users to work with than raw SPARQL, and help us as service managers by providing a predictable set of optimised queries. If you want to know more technical detail then we’ve written a paper about it (here).

The Simple API has seen live action at a Hack Day on World Cup news which we held in London in the summer. Attendees were able to develop a range of applications which probed violence, money and corruption in the realm of the World Cup. I blogged about our previous Hack Day here and here. The Simple API, and the Hack Day helped us shake out some bugs and add features which will make it even better next time.

“Next time” is another Hack Day to be held in the Amsterdam on 21st January 2015, and London on the 30th January 2015. This time we have processed 6,000,000 articles relating to the car industry over the period 2005-2014. The motor industry is a trillion dollar a year business, so we can anticipate finding lots of valuable information in this horde.

From our previous experience the three things that NewsReader excels at are:

  1. Finding networks of interactions, identifying important players. For the World Cup Hack Day we at ScraperWiki were handicapped slightly by having no interest in football! But the NewsReader technology enabled us to quickly identify that “Sepp Blatter”, “Jack Warner” and “Mohammed bin Hammam” were important in world football. This is illustrated in this slightly cryptic visualisation made using Gephi:beckham_and_blatter
  2. Finding events of a particular type. the NewsReader technology carries out semantic role labeling: taking sentences and identifying what type of event is described in that sentence and what roles the participants took. This information is then aggregated and exposed using semantic web technology. In the World Cup Hack Day participants used this functionality to identify events involving violence, bribery, gambling, and other financial transactions;
  3. Establishing timelines. In the World Cup data we could track the events involving “Mohammed bin Hammam” through time and the type of events he was involved in. This enabled us to quickly navigate to pertinent news articles.Timeline

You can see fragments of code used to extract these data using the Simple API in these GitHub Gists (here and here), and dynamic visualisations illustrating these three features here and here.

The Simple API is up and running already, you can find it (here). It is self-documenting, simply visit the root URL and you’ll see query examples with optional and compulsory parameters. Be aware though: the Simple API is under active development, and the underlying data in the KnowledgeStore is being optimised for the Hack Days so it may not be available when you visit.

If you want to join our automotive Hack Day then you can sign up for the Amsterdam event (here) and the London event (here).

Book review: Linked by Albert-László Barabási

This review was first posted at ScraperWiki.
linkedI am on a bit of a graph theory binge, it started with an attempt to learn about Gephi, the graph visualisation software, which developed into reading a proper grown up book on graph theory. I then learnt a little more about practicalities on reading Seven Databases in Seven Weeks, which included a section on Neo4J – a graph database. Now I move on to Linked by Albert-László Barabási, this is a popular account of the rise of the analysis of complex networks in the late nineties. A short subtitle used on earlier editions was “The New Science of Networks”. The rather lengthy subtitle on this edition is “How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life”.

In mathematical terms a graph is an abstract collection of nodes linked by edges. My social network is a graph comprised of people, the nodes, and their interactions such as friendships, which are the edges. The internet is a graph, comprising routers at the nodes and the links between them are edges. “Network” is a less formal term often used synonymously with graph, “complex” is more a matter of taste but it implies large and with a structure which cannot be trivially described i.e. each node has four edges is not a complex network.
The models used for the complex networks discussed in this book are the descendants of the random networks first constructed by Erdős and Rényi. They imagined a simple scheme whereby nodes in a network were randomly connected with some fixed probability. This generates a particular type of random network which do not replicate real-world networks such as social networks or the internet. The innovations introduced by Barabási and others are in the measurement of real world networks and new methods of construction which produce small-world and scale-free network models. Small-world networks are characterised by clusters of tightly interconnected nodes with a few links between those clusters, they describe social networks. Scale-free networks contain nodes with any number of connections but where nodes with larger numbers of connections are less common than those with a small number. For example on the web there are many web pages (nodes) with a few links (edges) but there exist some web pages with thousands and thousands of links, and all values in between.
I’ve long been aware of Barabási’s work, dating back to my time as an academic where I worked in the area of soft condensed matter. The study of complex networks was becoming a thing at the time, and of all the areas of physics soft condensed matter was closest to it. Barabási’s work was one of the sparks that set the area going. The connection with physics is around so-called power laws which are found in a wide range of physical systems. The networks that Barabási is so interested in show power law behaviour in the number of connections a node has. This has implications for a wide range of properties of the system such as robustness to the removal of nodes, transport properties and so forth. The book starts with some historical vignettes on the origins of graph theory, with Euler and the bridges of Königsberg problem. It then goes on to discuss various complex networks with some coverage of the origins of their study and the work that Barabási has done in the area. As such it is a pretty personal review. Barabási also recounts some of the history of “six degrees of separation”, the idea that everyone is linked to everyone else by only six links. This idea had its traceable origins back in the early years of the 20th century in Budapest.
Graph theory has been around for a long while, and the study of random networks for 50 years or so. Why the sudden surge in interest? It boils down to a couple of factors, the first is the internet which provides a complex network of physical connections on which a further complex network of connections sit in the form of the web. The graph structure of this infrastructure is relatively easy to explore using automatic tools, you can build a map of millions of nodes with relative ease compared to networks in the “real” world. Furthermore, this complex network infrastructure and the rise of automated experiments has improved our ability to explore and disseminate information on physical networks. For example, the network of chemical interactions in a cell, the network of actors in movies, our social interactions, the spread of disease and so forth. In the past getting such detailed information on large networks was tiresome and the distribution mechanisms for such data slow and inconvenient.
For a book written a few short years ago, Linked can feel strangely dated. It discusses Apple’s failure in the handheld computing market with the Newton palm top device, and the success of Palm with their subsequent range. Names of long forgotten internet companies float by, although even at the time of writing Google was beginning its dominance.
If you are new to graph theory and want an unchallenging introduction then Linked is a good place to start. It’s readable and has a whole load of interesting examples of scale free networks in the wild. Whilst not the whole of graph theory, this is where interesting new things are happening.

Feminism

A couple of days ago everyone on twitter (and off) was very excited: ESA landed a probe called Philae on a comet shaped like a duck. I was going to write about the appropriateness or otherwise of ESA project scientist  Matt Taylor’s shirt – it featured quite a few scantily-clad ladies.

On the face of it the story should have been: man wears offensive shirt on TV, people point out that it’s offensive, man removes shirt, man apologises. That is what happened, Matt Taylor seems a nice enough chap who made a mistake which he rectified and gave, what I’d consider, a proper apology.

The moment has passed, better writers than me have written a lot about the incident, but it has highlighted a theme.

The women that said the shirt was offensive received a torrent of abuse, including threats of sexual violence, and the men who did exactly the same thing didn’t. Friends on twitter did experiments where one (male) tweeted exactly the same thing as his (female) partner and got completely different responses: immediate abuse, continuing over 48 hours in the case of the woman, very little for the man. I’ve been moderately vocal and received pretty much nothing in terms of abuse, certainly in the first instance. It’s all very well saying that people should report abuse then move on, or that the threats are empty. But overwhelmingly it is women being threatened, not men. Twitter’s reporting mechanisms are restrictive and they appear unconcerned. And a threat is empty until it isn’t, and then it’s too late.

Several women I know simply don’t comment on “contentious” issues online because they know what response they’ll get. And this happens again and again and again and again and again and again.

Over the last few years on twitter I’ve come to realise that women lead different lives to me, they experience a whole bunch of things that I’ve never even contemplated as a risk. Since I joined I’ve learnt of:

  • the woman stuck in a pub toilet with men outside threatening to gang rape her;
  • the woman who cycles to work in London who gets groped and catcalled on a regular basis;
  • the women who never finished their PhDs because their male supervisors considered them to be sexual prey;
  • the women in science communication who were never quite sure whether whether they were published on merit or because the editor of that website had designs on them;
  • the women who don’t go on scientific field trips because basically they are too dangerous;
  • the women at conferences who think carefully about getting into a lift alone with a man;
  • the woman that won’t walk along the canal towpath in broad daylight;
  • the woman who wants her named removed from a football ground if they re-employ a convicted and unrepentant rapist and gets rape threats in return;
  • the woman who was sexually assaulted on a train;
  • the women who said it would be nice to have women on banknotes, and were threatened with rape;
  • the woman who supported immigration on Question Time and received abuse, and a bomb threat;
  • the woman who was going to give a talk about the portrayal of women in computer games but was cancelled because of the death threats made against her and the audience;
  • the woman who has suffered domestic violence;
  • the women who were groped by a senior party official, who never showed any remorse when uncovered;
  • the women who doesn’t wear headphones in the street;
  • the woman who gets followed on the London Underground;

Some of these are high profile public incidents, others are not but they are all women doing ordinary, unexceptional things. They’re spread over a number of years, and I follow a fair number of people. But nevertheless, regardless of public statistics, they are something that never impinged on me in the past.

I didn’t like the term feminist because it always brought to mind those women that told me everything men did was wrong but now I realise I was wrong. The feminists are the people that speak up and say “That thing you are doing is wrong“, the women in that group are attacked mercilessly in a way the men aren’t. I allowed my impressions of those women to be dominated by their attackers.

Apologies for being so slow on the uptake, I’m trying to do better in future.

Landing on a comet

There was a striking contrast in the office on Friday between the former practicing scientists and the developers, with an open data background, who were bemoaning the slowness with which results were being reported by the ESA team looking after the Rosetta orbiter and the Philae lander.

As I pointed out, many years ago I sat in an instrument hutch at the Rutherford Appleton Laboratory trying to work out what the hell was going on with my experiment downstairs, behind an interlocked door, being flooded with a beam of invisible neutrons. It was possible that I was discovering new and important science. But on the other hand it was also possible that the goniometer third from the left on my sample changer was up to its usual tricks and had failed to move when instructed. We couldn’t tell from the frankly inadequate early nineties video feed. The only way to find out was to wait for the experiment to finish and go and eyeball the damn thing.

Goniometer number 3 had failed, again.

Peter, who did his PhD at CERN, replied – “what he said!”

Book review: Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson

sevendatabases

This review was first published at Scraperwiki.

I came to databases a little late in life, as a physical scientist I didn’t have much call for them. Then a few years ago I discovered the wonders of relational databases and the power of SQL. The ScraperWiki platform strongly encourages you to save data to SQLite databases to integrate with its tools.

There is life beyond SQL databases much of it evolved in the last few years. I wanted to learn more and a plea on twitter quickly brought me a recommendation for Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson.

The book covers the key classes of database starting with relational databases in the form of PostgreSQL. It then goes on to look at six further databases in the so-called NoSQL family – all relatively new compared to venerable relational databases. The six other databases fall into several classes: Riak and Redis are key-value stores, CouchDB and MongoDB are document databases, HBase is a columnar database and Neo4J is a graph database.

Relational databases are characterised by storage schemas involving multiple interlinked tables containing rows and columns, this layout is designed to minimise the repetition of data and to provide maximum query-ability. Key-value stores only store a key and a value in the manner of a dictionary but the “value” may be of a complex type. A value can be returned very fast given a key – this is the core strength of the key-value stores. The document stores MongoDB and CouchDB store JSON “documents” rather than rows. These documents can store information in nested hierarchies which don’t necessarily need to all have the same structure this allows maximum flexibility in the type of data to be stored but at the cost of ease of query.

HBase fits into the Hadoop ecosystem, the language used to describe it looks superficially like that used to describe tables in a relational database but this is a bit misleading. HBase is designed to work with massive quantities of data but not necessarily give the full querying flexibility of SQL. Neo4J is designed to store graph data – collections of nodes and edges and comes with a query language particularly suited to querying (or walking) data so arranged. This seems very similar to triplestores and the SPARQL – used in semantic web technologies.

Relational databases are designed to give you ACID (Atomicity, Consistency, Isolation, Durability), essentially you shouldn’t be able to introduce inconsistent changes to the database and it should always give you the same answer to the same query. The NoSQL databases described here have a subtly different core goal. Most of them are designed to work on the web and address CAP (Consistency, Availability, Partition), indeed several of them offer native REST interfaces over HTTP which means they are very straightforward to integrate into web applications. CAP refers to the ability to return a consistent answer, from any instance of the database, in the face of network (or partition) problems. This assumes that these databases may be stored in multiple locations on the web. A famous theorem contends that you can have any two of Consistency, Availability and Partition resistance at any one time but not all three together.

NoSQL databases are variously designed to scale horizontally and vertically. Horizontal scaling means replicating the same database in multiple places to provide greater capacity to serve requests even with network connectivity problems. Vertically scaling by “sharding” provides the ability to store more data by fragmenting the data such that some items are stored on one server and some on another.

I’m not a SQL expert by any means but it’s telling that I learnt a huge amount about PostgreSQL in the forty or so pages on the database. I think this is because the focus was not on the SQL query language but rather on the infrastructure that PostgreSQL provides. For example, it discusses triggers, rules, plugins and specialised indexing for text search. I assume this style of coverage applies to the other databases. This book is not about the nitty-gritty of querying particular database types but rather about the different database systems.

The NoSQL databases generally support MapReduce style queries this is a scheme most closely associated with Big Data and the Hadoop ecosystem but in this instance it is more a framework for doing queries which maybe executed across a cluster of computers.

I’m on a bit of a graph theory binge at the moment so Neo4J was the most interesting to me.

As an older data scientist I have a certain fondness for things that have been around for a while, like FORTRAN and SQL databases, I’ve looked with some disdain at these newfangled NoSQL things. To a degree this book has converted me, at least to the point where I look at ScraperWiki projects and think – “It might be better to use a * database for this piece of work”.

This is an excellent book which was pitched at just the right level for my purposes, I’ll be looking for more Pragmatic Programmers books in future.