Dr Administrator

Author's posts

Face ReKognition

G8Italy2009

This post was first published at ScraperWiki. The ReKognition API has now been withdrawn.

I’ve previously written about social media and the popularity of our Twitter Search and Followers tools. But how can we make Twitter data more useful to our customers? Analysing the profile pictures of Twitter accounts seemed like an interesting thing to do since they are often the faces of the account holder and a face can tell you a number of things about a person. Such as their gender, age and race. This type of demographic information is useful for marketing, and understanding who your product appeals to. It could also be a way of tying together public social media accounts since people like me use the same image across multiple accounts.

Compact digital cameras have offered face recognition for a while, and on my PC, Picasa churns through my photos identifying people in them. I’ve been doing image analysis for a long time, although never before on faces. My first effort at face recognition involved using the OpenCV library. OpenCV provides a whole suite of image analysis functions which do far more than just detect faces. However, getting it installed and working with the Python bindings on a PC was a bit fiddly, documentation was poor and the built-in face analysis capabilities were poor.

Fast forward a few months, and I spotted that someone had cast the ReKognition API over the images that the British Library had recently released, a dataset I’ve been poking around at too. The ReKognition API takes an image URL and a list of characteristics in which you are interested. These include, gender, race, age, emotion, whether or not you are wearing glasses or, oddly, whether you have your mouth open. Besides this summary information it returns a list of feature locations (i.e. locations in the image of eyes, mouth nose and so forth). It’s straightforward to use.

But who should be the first targets for my image analysis? Obviously, the ScraperWiki team! The pictures are quite small but ReKognition identified I was a “Happy, white, male, age 46 with no glasses on and my mouth shut”. Age 46 is a bit harsh – I’m actually 39 in my profile picture. A second target came out “Happy, Indian, male, age 24.7, with glasses on and mouth shut”. This was fairly accurate, Zarino was 25 when the photo was taken, he is male, has his glasses on but is not Indian. Two (male) members of the team, have still not forgiven ReKognition for describing them as female, particularly the one described as a 14 year old.

Fun as it was, this doesn’t really count as an evaluation of the technology. I investigated further by feeding in the photos of a whole load of famous people. The results of this are shown in the chart below. The horizontal axis is someone’s actual age, the vertical axis shows their age predicted by ReKognition. If the predictions were correct the points representing the celebrities would fall on the solid line. The dotted line shows a linear regression fit to the data. The equation of the line y = 0.673x (I constrained it to pass through zero) tells us that the age is consistently under-predicted by a third, or perhaps celebrities look younger than they really are! The R2 parameter tells us how good the fit is: a value of 0.7591 is not too bad.

ReKognitionFacePeopleChart

I also tried out ReKognition on a couple of class photos – taken at reunions, graduations and so forth. My thinking here being that I would get a cohort of people aged within a year of each other. These actually worked quite well; for older groups of people I got a standard deviation of only 5 years across a group of, typically, 10 people. A primary school class came out at 16+/-9 years, which wasn’t quite so good. I suspect the performance here is related to the fact that such group photos are taken relatively carefully and the lighting and setup for each face in the photo is, by its nature, the same.

Looking across these experiments: ReKognition is pretty good at finding faces in photos, and not find faces where there are none (about 90% accurate). It’s fairly good with gender (getting it right about 80% of the time, typically struggling a bit with younger children), it detects glasses pretty well. I don’t feel I tested it well on race. On age results are variable, for the ScraperWiki set the R^2 value for linear regression between actual and detected ages is about 0.5. Whilst for famous people it is about 0.75. In both cases it tends to under-estimate age and has never given an age above 55 despite being fed several more mature celebrities and grandparents. So on age, it definitely tells you something and under certain circumstances it can be quite accurate. Don’t forget the images we’re looking at are completely unconstrained, they’re not passport photos.

Finally, I applied face recognition to Twitter followers for the ScraperWiki account, and my personal account. The Summarise This Data tool on the ScraperWiki Platform provides a quick overview of the data added by face recognition.

face_recognition_data

It turns out that a little over 50% of the followers of both accounts have a picture of a human face as their profile picture. It’s clear the algorithm makes the odd error mis-identifying things that are not human faces as faces (including the back of a London Taxi Cab). There’s also the odd sketch or cartoon of a face, rather than a photo and some accounts have pictures of famous people, rather than obviously the account holder. Roughly a third of the followers of either account are identified as wearing glasses, three quarters of them look happy. Average ages in both cases were 30. The breakdown in terms of race is 70:13:11:7 White:Asian:Indian:Black. Finally, my followers are approximately 45% female, and those of ScraperWiki are about 30% female.

We’re now geared up to apply this to lists of Twitter followers – are you interested in learning more about your followers? Then send us an email and we’ll be in touch.

Book review: Hadoop in Action by Chuck Lam

HadoopInAction

This review was first published at ScraperWiki.

Hadoop in Action by Chuck Lam provides a brief, fairly technical introduction to the Hadoop Big Data ecosystem. Hadoop is an open source implementation of the MapReduce framework originally developed by Google to process huge quantities of web search data. The name MapReduce, refers to dividing up jobs amongst multiple processors (“Mapping”) and then recombining results to provide an answer to the problem (“Reducing”). Hadoop, allows users to process large quantities of data by distributing it across a network of relatively cheap computers. Each computer in the network has a portion of the data to process, and at the end it is combined together to give the final reult. Hadoop provides the infrastructure to enable this. In a sense it is a distributed operated system which provides fundamental services to applications such as Hive and Pig.

At ScraperWiki we’ve had many philosophical discussions about the meaning of Big Data. What size is Big Data? Is it one million lines? Is it one billion lines? Should we express it in terms of gigabytes and terabytes rather than lines?

For many, Big Data is data that requires you use Hadoop or similar to process.

Our view is that Big Data is data big enough to break the tools or hardware you commonly use, so for many of our customers this is a software limit based on Microsoft Excel. Technically Excel can handle a million or so lines but practically speaking life gets uncomfortable as you go above a few tens of thousands of rows.

The largest dataset a customer has come to us with so far is the UK MOT test data – results of the roadworthiness test for every vehicle on the road in the UK, over three years old. This dataset is about 100 million lines and 4 gigabytes per year, it’s available back to 2005 giving a total of approximately 1 billion lines and 32GB. A single year can be readily analysed on an Intel i7 laptop with 8GB RAM, MySQL and Tableau by readily I mean that some indexing jobs can take up to an hour but once indexed most queries are 10 – 20 minutes maximum.

At ScraperWiki a number of us have backgrounds in the physical sciences where we’ve been interested in computational intensive simulations involving clusters of commodity hardware which pre-date Hadoop (Beowulf), or big data from the Large Hadron Collider. Physical scientists have long been interested in parallel computing where the amount of data to move around is small but the amount of calculation to do is large. The point here is that parallel computing is possible for a subset of problems where a task can be divided up into smaller chunks of data and processing to be distributed amongst multiple processors or machines. In the case of photorealistic computer graphics rendering this might be frames of video, or a portion of a whole scene. Software like Matlab, Fortran and computer graphics renderers can parallelise certain operations with relative ease. The difficulty has always been turning your big computing problem into one of those “certain operations”. The Large Hadron Collider is an example more suited to the Hadoop style approach, the data flows are enormous but the calculations performed on that data are, comparatively less troublesome.

Hadoop in Action spends a significant amount of time discussing the core Hadoop system and MapReduce processing framework. I must admit to finding this part rather dull. I perked up when we reached Pig, described as a data processing language and Hive, a SQL-like system. One gets the impression the Pig system was built around a naming convention pushed too far: the Pig commandline is called Grunt and the language used by Pig is Pig Latin. Pig and Hive look like systems where I could sit down and run some queries with a language looks like my old friend, SQL.

The book finishes with some case studies, these are an image conversion problem, machine learning and data processing at China Mobile, the Stumbleupon social bookmarking system and providing search for IBM’s intranet. In the latter three cases users were migrating from SQL based systems running on monolithic hardware. To give an idea of scale: China Mobile collect terabytes of data per day across 100s of millions of customers, the IBM intranet has something like 100 million pages and 16 million documents and Stumbupon has 25 million users clicking their Stumble buttons about 1 billion times in a month.

Overall, a handy introduction to Hadoop although perhaps oddly pitched – it’s probably too technical for most managers, not technical enough for system administrators and with insufficient applications for data scientists. If you want to get hands on experience of Hadoop, then the Hortonworks Sandbox provides a pre-packaged virtual machine with a web interface for you to try out the various technologies.

If you want us to help you get value out of your big data or even Big Data, please get in touch!

Book review: Python for Data Analysis by Wes McKinney

PythonForDataAnalysis_cover

This review was first published at ScraperWiki.

As well as developing scrapers and a data platform, at ScraperWiki we also do data analysis. Some of this is just because we’re interested, other times it’s because clients don’t have the tools or the time to do the analysis they want themselves. Often the problem is with the size of the data. Excel is the universal solvent for data analysis problems – go look at any survey of data scientists. But Excel has it’s limitations. There are the technical limitations of something like a million rows maximum size but well before this size Excel becomes a pain to use.

There is another path – the programming route. As a physical scientist of moderate age I’ve followed these two data analysis paths in parallel. Excel for the quick look see and some presentation. Programming for bigger tasks, tasks I want to do repeatedly and types of data Excel simply can’t handle – like image data. For me the programming path started with FORTRAN and the NAG libraries, from which I moved into Matlab. FORTRAN is pure, traditional programming born in the days when you had to light your own computing fire. Matlab and competitors like Mathematica, R and IDL follow a slightly different path. At their core they are specialist programming languages but they come embedded in graphical environments which can be used interactively. You type code at a prompt and stuff happens, plots pop up and so forth. You can capture this interaction and put it into scripts/programs, or simply write programs from scratch.

Outside the physical sciences, data analysis often means databases. Physical scientists are largely interested in numbers, other sciences and business analysts are often interested in a mixture of numbers and categorical things. For example, in analysing the performance of a drug you may be interested in the dose (i.e. a number) but also in categorical features of the patient such as gender and their symptoms. Databases, and analysis packages such as R and SAS are better suited to this type of data. Business analysts appear to move from Excel to Tableau as their data get bigger and more complex. Tableau gives easy visualisation of database shaped data. It provides connectors to many different databases. My workflow at ScraperWiki is often Python to SQL database to Tableau.

Python for Data Analysis by Wes McKinney draws these threads together. The book is partly about the range of tools which make Python an alternative to systems like R, Matlab and their ilk and partly a guide to McKinney’s own contribution to this area: the pandas library. Pandas brings R-like dataframes and database-like operations to Python. It helps keep all your data analysis needs in one big Python-y tent. Dataframes are 2-dimensional tables of data whose rows and columns have indexes which can be numeric but are typically text. The pandas library provides a great deal of functionality to process Dataframes, in particular enabling filtering and grouping calculations which are reminiscent of the SQL database workflow. The indexes can be hierarchical. As well as the 2-dimensional Dataframe, pandas also provides 1-dimensional Series and a 3-dimensional Panel data structures.

I’ve already been using pandas in the Python part of my workflow. It’s excellent for importing data, and simplifies the process of reshaping data for upload to a SQL database and onwards to visualisation in Tableau. I’m also finding it can be used to help replace some of the more exploratory analysis I do in Tableau and SQL.

Outside of pandas the key technologies McKinney introduces are the ipython interactive console and the NumPy library. I mentioned the ipython notebook in my previous book review. ipython gives Python the interactive analysis capabilities of systems like Matlab. The NumPy library is a high performance library providing simple multi-dimensional arrays, comforting those who grew up with a FORTRAN background.

Why switch from commercial offerings like Matlab to the Python ecosystem? Partly it’s cost, the pricing model for Matlab has a moderately expensive core (i.e. $1000) with further functionality in moderately expensive toolboxes (more $1000s). Furthermore, the most painful and complex thing I did at my previous (very large) employer was represent users in the contractual interactions between my company and Mathworks to license Matlab and its associated tool boxes for hundreds of employees spread across the globe. These days Python offers me a wider range of high quality toolboxes, at it’s core it’s a respectable programming language with all the features and tooling that brings. If my code doesn’t run it’s because I wrote it wrong, not because my colleague in Shanghai has grabbed the last remaining network license for a key toolbox. R still offers statistical analysis with greater gravitas and some really nice, publication quality plotting but it does not have the air of a general purpose programming language.

The parts of Python for Data Analysis which I found most interesting, and engaging, were the examples of pandas code in “live” usage. Early in the book this includes analysis of first names for babies in the US over time, with later examples from the financial sector – in which the author worked. Much of the rest is very heavy on showing code snippets which is distracting from a straightforward reading of the book.  In some senses Mining the Social Web has really spoiled me – I now expect a book like this to come with an Ipython Notebook!

Trainspotting

eurostar-logoI am something of a trainspotter.

That’s not to say I have ever stood at the end of a platform writing down the numbers of the trains that go by, rather that have an interest in things of a railway nature. So obviously I was very excited to get the opportunity to go to Paris on the train.

I’ve been to Paris on the train before. Fifteen years or so ago HappyMouffetard and I travelled from Cambridge to Paris for the odd weekend. In fact that’s where HappyMouffetard picked up her twitter handle. In those days the terminus for the Eurostar was at Waterloo, so the trip meant crossing London from Kings Cross where the Cambridge train came in. Once on the Eurostar you pottered through Kent to the Dover end of the tunnel at what seemed like barely more than walking pace. After passing through the tunnel to France the train accelerated for a while before the guard told us we were travelling at some unimaginable velocity. He sounded a bit smug. The Eurostar would then whine rapidly through northern France to arrive at Gare du Nord.

Things have changed. Now the Eurostar terminus is at St Pancras which is next door to Kings Cross and a short step down the road from Euston, the station I arrive at from Chester. St Pancras International is a rather fine station, particularly when compared to the competition: airports. Not only does it offer a long bank of charging points but also free Wifi! The trip to Dover is transformed, the train plunges underground for the first few miles but then whizzes along at positively unBritish speeds to the Channel. A little over two hours after leaving London, you are in Paris. Pick the right trains and there are just two scheduled stops between Chester and Paris (at Crewe and in London)!

This makes the whole journey rather more of a practical proposition, even if you are travelling from northern England. Chester to London is currently a little over two hours travel time, it would take me an hour and a bit to reach Manchester airport. Check-in for Eurostar is an hour or so, and then a couple of hours to Paris and you end up at Gare du Nord in the centre of Paris rather than Charles de Gaulle Airport – some distance away. Once at Gare du Nord you walk straight off the train onto the street. Similarly on my return trip, I walk straight off the train and I’m on the platform at Euston in 15 minutes.

I’ve rarely found airports relaxing, they seem hellholes of “duty-free” shopping, stressed travellers, over-crowding, bad food, building works to insert more shopping opportunities, suffused with baseline low-level dread that the implausibility of powered flight invokes. The only exceptions I’ve found are when I’ve been able to travel business class and take refuge in the business lounge. In fact, I’m not bothered about the business class flying experience – it’s the lounge I’d pay for myself! And once you’re on the plane it’s cheek by jowl with your fellow man, and air hostesses trying to force plastic food upon you, hand luggage woes as there is insufficient space for the hand luggage everyone now carries since you get gouged for hold luggage.

Cost wise things aren’t so happy, a train to London is expensive unless you travel “off peak”, a small window in the middle of the day and later in the evening.

In summary, from Chester to Paris:

Flying: 1 hour 30 minutes + 2hours check-in + 1 hour 30 minutes flight + 1 hour to Paris + airport hell = nasty 6 hours

Train: 2 hours + 1 hour check-in + 2 hours + shiny nice things = nice 5 hours

Book review: Mining the Social Web by Matthew A. Russell

mining_the_social_web_cover

This review was first published at ScraperWiki.

The twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.

In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.

The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!

Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.

I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.

The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.

Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.