Tag: data science

Book review: The Signal and the Noise by Nate Silver

thesignalandthenoise

This review was first published at Scraperwiki.

Nate Silver first came to my attention during the 2008 presidential election in the US. He correctly predicted the outcome of the November results in 49 of 50 states, missing only on Indiana where Barack Obama won by just a single percentage point. This is part of a wider career in prediction: aside from a job at KPMG (which he found rather dull), he has been a professional poker player and run a baseball statistics website.

His book The Signal and the Noise: The Art and Science of Prediction looks at prediction in a range of fields: economics, disease, chess, baseball, poker, climate, weather, earthquakes and politics. It is both a review and a manifesto; a manifesto for making better prediction.

The opening chapter is on the catastrophic miscalculation of the default rate for collateralized debt obligations (CDO) which led in large part to the recent financial crash. The theme for this chapter is risk and uncertainty. In this instance, risk means the predicted default rate for these financial instruments and uncertainty means the uncertainty in those predictions. For CDOs the problem was that the estimates of risk were way off, and there was no recognition of the uncertainty in those risk estimates.

This theme of unrecognised uncertainty returns for the prediction of macroeconomic indicators such as GDP growth and unemployment. Here again forecasts are made, and practitioners in the field know these estimates are subject to considerable uncertainty, but the market for prediction, and simple human frailty mean that these uncertainties are ignored. I’ve written elsewhere on both the measurement and prediction of GDP. In brief, both sides of the equation are so fraught with uncertainty that you might as well toss a coin to predict GDP!

The psychology of prediction returns with a discussion of political punditry and the idea of “foxes” and “hedgehogs”. In this context “hedgehogs” are people with one Big Idea who make their predictions based on their Big Idea and are unshiftable from it. In the political arena the “Big Idea” may be ideology but as a scientist I can see that science can be afflicted in the same way. To a man with a hammer, everything is a nail. “Foxes”, on the other hand, are more eclectic and combine a range of data and ideas to predict, as a result they are more successful.

In some senses, the presidential prediction is a catching chickens in a coop exercise. There is quite a bit of data to use and your fellow political pundits are typically misled by their own prejudices, they’re “hedgehogs”, so all you, the “fox” needs to do is calculate in a fairly rational manner and you’re there. Silver returns to the idea of diversity in prediction, combining multiple models, or data of multiple types in his manifesto for better prediction.

There are several chapters on scientific prediction, looking at the predictions of earthquakes, the weather, climate and disease. There is not an overarching theme across these chapters. The point about earthquakes is that precise prediction about where, when and how big appears to be impossible. At short range, the weather is predictable but, beyond a few days, seasonal mean predictions and “the same as today” predictions are as good as the best computer simulations.

Other interesting points raised by this chapter are the ideas of judging predictions by accuracy, honesty and economic value. Honesty in this sense means “is this the best model I can make at this point?”. The interesting thing about weather forecasts in the US is that the National Weather Service makes the results of simulations available to all. Big national value-adders such as the Weather Channel produce a “wet bias” forecast which systematically overstates the possibility of rain for lower values of the chances of rain. This is because customers prefer to be told that there is, say, a 20% chance of rain when it actually turns out to be dry, than be told the actual chance of rain (say 5%) and for it to rain. This “wet bias” gives customers what they want.

Finally, there are predictions on games. On poker, baseball, basketball and chess. The benefits of these systems are the large amounts of data available. The US in particular has a thriving “gaming statistics” culture. The problems are closed, in the sense that there are a set of known rules to each game. And finally, there is ample opportunity for testing predictions against outcomes. This final factor is important in improving prediction.

In technical terms, Silver’s call to arms is for the Bayesian approach to predictions. With this approach, an attempt is made to incorporate prior beliefs into prediction through the use of an estimate of the prior probability and Bayes’ Theorem. Silver exemplifies this with a discussion of prediction in poker. Essentially, poker is about predicting your opponents’ hands and their subsequent actions based on those cards. To be successful, poker players must do this repeatedly and intuitively. However, luck still plays a large part in poker and only the very best players make a profit long term.

The book is structured around a series of interviews with workers in the field, and partly as a result of this it reads rather nicely – it’s not just a dry exposition of facts, but has some human elements too. Though it’s no replacement for a technical reference, it’s worth as a survey of prediction, as well as for the human interest stories.

The Third Way

macbook_airOperating Systems were the great religious divide of our age.

A little over a year ago I was writing about my experiences setting up my Sony Vaio Windows 8 laptop to run Ubuntu on a virtual machine. Today I am exploring the Third Way – I’m writing this on a MacBook Air. This is the result of a client requirement: I’m working with the Government Digital Service who are heavily Mac oriented.

I think this makes me a secularist in computing terms.

My impressions so far:

Things got off to a slightly rocky start, the MacBook I’m using is a hand-me-down from a departing colleague. We took the joint decision to start from scratch on this machine and as a result of some cavalier disk erasing we ended up with a non-booting MacBook. In theory we should have been able to do a reinstall over the internet, in practice this didn’t work. So off I marched to our local Apple Store to get things sorted. The first time I’d entered such an emporium. I was to leave disappointed, it turned out I needed to make an appointment for a “Genius” to triage my laptop and the next appointment was a week hence, and I couldn’t leave the laptop behind for a “Genius Triage”. Alternatively, I could call Apple Care.

As you may guess this Genius language gets my goat! My mate Zarino was an Apple College Rep – should they have called him a Jihadi? Could you work non-ironically with a job title of Genius?

Somewhat bizarrely, marching the Air to the Apple Store and back fixed the problem, and an hour or so later I had a machine with an operating system. Perhaps it received a special essence from the mothership. On successfully booting my first actions were to configure my terminal. For the initiated the terminal is the thing that looks like computing from the early 80s – you type in commands at a prompt and are rewarded with more words in return. The reason for this odd choice was the intended usage. This MacBook is for coding, so next up was installing Sublime Text. I now have an environment for coding which superficial looks like the terminal/editor combination I use in Windows and Ubuntu!

It’s worth noting that for the MacBook the bash terminal I am using is a native part of the operating system, as it is for the Ubuntu VM on Windows the bash terminal is botched on to make various open source tools work.

Physically the machine is beautiful. My Vaio is quite pretty but compared to the Air it is fat and heavy. It has no hard disk indicator light. It has no hard disk, rather a 256GB SSD which means it boots really fast. 256GB is a bit small for me these days, with a title of data scientist I tend to stick big datasets on my laptop.

So far I’ve been getting used to using cmd+c and cmd+v to copy and paste, having overwritten stuff repeatedly with “v” having done the Windows ctrl+v. I’m getting used to the @ and ” keys being in the wrong place. And the menu bar for applications always appearing at the top of the screen, not the top of the application window. Fortunately the trackpad I can configure to simulate a two button mouse, rather than the default one button scheme. I find the Apple menu bar at the top a bit too small and austere and the Dock at the bottom is a bit cartoony. The Notes application is a travesty, a little faux notebook although I notice in OS X Mavericks it is more business-like.

For work I don’t anticipate any great problems in working entirely on a Mac, we use Google Apps for email and make extensive use of Google Docs. We use online services like Trello, GitHub and Pivotal in place of client side applications. Most the coding I do is in Python. The only no go area is Tableau which is currently only available on Windows.

I’ve never liked the OS wars, perhaps it was a transitional thing. I grew up in a time when there were a plethora of home computers. I’ve written programs on TRS-80s, Commodore VIC20, Amstrad CPC464s, Sinclair ZX81 and been aware of many more. At work I’ve used Dec Alphas, VAX/VMS and also PCs and Macs. Latterly everything is one the web, so the OS is just a platform for a browser.

I’m thinking of strapping the Air and the Vaio back to back to make a triple booting machine!

Messier and messier

Regular readers with a good memory will recall I bought a telescope about 18 months ago. I bemoaned the fact that I bought it in late Spring, since it meant it got dark rather late. I will note here that astronomy is generally incompatible with a small child who might wake you up in the middle of the night, requiring attention and early nights.

Since then I’ve taken pictures of the sun, the moon, Jupiter, Saturn and as a side project I also took wide angle photos of the Milky Way and star trails (telescope not required). Each of these bought their own challenges, and awe. The sun because it’s surprisingly difficult to find the thing in you view finder with the serious filter required to stop you blinding yourself when you do find it. The moon because it’s just beautiful and fills the field of view, rippling through the “seeing” or thermal turbulence of the atmosphere. Jupiter because of it’s Galilean moons, first observed by Galileo in 1610. Saturn because of it’s tiny ears, I saw Saturn on my first night of proper viewing. As the tiny image of Saturn floated across my field of view I was hopping up and down with excitement like a child.

I’ve had a bit of a hiatus in the astrophotography over the past year but I’m ready to get back into it.

My next targets for astrophotography are the Deep Sky Objects (DSOs), these are largish faint things as opposed to planets which are smallish bright things. My accidental wide-angle photos clued me into the possibilities here. I’d been trying to photograph constellations, which turn out to be a bit dull, at the end of the session I put the sensitivity of my camera right up and increased the exposure time and suddenly the Milky Way appeared! Even in rural Wales it was only just visible to the naked eye.

Now I’m keen to explore more of these faint objects. The place to start is the Messier Catolog of objects. This was compiled by Charles Messier and Pierre Méchain in the latter half of the 18th century. You may recognise the name Méchain, he was one of the two French men who surveyed France on the cusp of the Revolution to define a value for the meter. Ken Alder’s book The Measure of All Things, describes their adventures.

Messier and Mechain weren’t interested in the deep sky objects, they were interested in comets and compiled the list in order not to be distracted from their studies by other non-comety objects. The list is comprised of star clusters, nebula and galaxies. I must admit to being a bit dismissive of star clusters. The Messier list is by no means exhaustive, observations were all made in France with a small telescope so there are no objects from the Southern skies. But they are ideal for amateur astronomers in the Northern hemisphere since the high tech, professional telescope of the 18th century is matched by the consumer telescope of the 21st.

I’ve know of the Messier objects since I was a child but I have no intuition as to where they are, how bright and how big they are. So to get me started I found some numbers and made some plots.

The first plot shows where the objects are in the sky. They are labelled, somewhat fitfully with their Messier number and common name. Their locations are shown by declination, how far away from the celestial equator an object is, towards the North Pole and right ascension, how far around it is along a line of celestial latitude. I’ve added the moon to the plot in a fixed position close to the top left. As you can see the majority of the objects are North of the celestial equator. The size of the symbols indicates the relative size of the objects. The moon is shown to the same scale and we can see that a number of the objects are larger than the moon, these are often star clusters but galaxies such as Andromeda – the big purple blob on the right and the Triangulum Galaxy are also bigger than the moon. As is the Orion nebula.

Position

So why aren’t we as familiar with these objects as we are with the moon. The second plot shows how bright the Messier objects are and their size. The horizontal axis shows their apparent size – it’s a linear scale so that an object twice as far from the vertical axis is twice as big. Note that these are apparent sizes, some things appear larger than others because they are closer. The Messier The vertical axis shows the apparent brightness, in astronomy brightness is measured in units of “magnitude” which is a logarithmic scale. This means that although the sun is roughly magnitude –26 and the moon is roughly magnitude –13, the sun is 10,000 times bright than the moon. The Messier objects are all much dimmer than Venus, Jupiter and Mercury and generally dimmer than Saturn.

Size-Magnitude

 

So the Messier objects are often bigger but dimmer than things I have already photographed. But wait, the moon fills the field of view of my telescope. And not only that my telescope has an aperture of f/10 – a measure of it’s light gathering power. This is actually rather “slow” for a camera lens, my “fastest” lens is f/1.4 which represents a 50 fold larger light gathering power.

For these two reasons I have ordered a new lens for my camera, a Samyang 500mm f/6.3 this is going to give me a bigger field of view than my telescope which has a focal length of 1250mm. And also more light gathering power – my new lens should have more than double the light gathering power!

Watch this space for the results of my new purchase!

Sublime

sublime_text

Sublime Text

Coders can be obsessive about their text editors. Dividing into relatively good natured camps. It is text editors not development environments over which they obsess and the great schism is between is between the followers of vim and those of Emacs. The line between text editor and development environment can be a bit fuzzy. A development environment is designed to help you do all the things required to make working software (writing, testing, compiling, linking, debugging, organising projects and libraries), whilst a text editor is designed to edit text. But sometimes text editors get mission creep.

vim and emacs are both editors with long pedigree on Unix systems. vim‘s parent, vi came into being in 1976, with vim being born in 1991, vim stands for “Vi Improved”. Emacs was also born in 1976. Glancing at the emacs wikipedia page I see there are elements of religiosity in the conflict between them.

To users of OS X and Windows, vim and emacs look and feel, frankly, bizarre. They came into being when windowed GUI interfaces didn’t exist. In basic mode they offer a large blank screen with no icons or even text menu items. There is a status line and a command line at the bottom of the screen. Users interact by issuing keyboard commands, they are interfaces with only keyboard shortcuts. It’s said that the best way to generate a random string of characters is to put a class of naive computer science undergraduates down in front of vim and tell them to save the file and exit the program! In fact to demonstrate the point, I’ve just trapped myself in emacs  whilst trying to take a screen shot.

selinux_vim_0

vim, image by Hermann Uwe

GNU emacs-[1]

emacs, image by David Mundy

vim and emacs are both incredibly extensible, they’re written by coders for coders. As a measure of their flexibility: you can get twitter clients which run inside them.

I’ve used both emacs and vim but not warmed to either of them. I find them ugly to look at and confusing, I don’t sit in front on an editor enough of the day to make remembering keyboard shortcuts a comfortable experience. I’ve used the Matlab, Visual Studio and Spyder IDEs but never felt impassioned enough to write a blog post about them. I had a bad experience with Eclipse, which led to one of my more valued Stackoverflow answers.

But now I’ve discovered Sublime Text.

Sublime Text is very beautiful, particularly besides vim and emacs. I like the little inset in the top right of my screen which shows the file I’m working on from an eagle’s perspective, the nice rounded tabs. The colour scheme is subtle and muted, and I can get a panoply of variants on the theme. At Unilever we used to talk about trying to delight consumers with our products – Sublime Text does this. My only wish is that it went the way of Google Chrome and got rid of the Windows bar at the top.

Not only this, as with emacs and vim, I can customise Sublime Text with code or use other packages other people have written and in my favoured language, Python.

I use Sublime Text mainly to code in Python, using a Git Bash prompt to run code and to check it into source control. At the moment I have the following packages installed:

  • Package Control – for some reasons the thing that makes it easy to add new packages to Sublime Text comes as a separate package which you need to install manually;
  • PEP8 Autoformat – languages have style guides. Soft guidelines to ensure consistent use of whitespace, capitalisation and so forth. Some people get very up tight about style. PEP8 is the Python style guide, and PEP8 autoformat allows you to effortlessly conform to the style guide and so avoid friction with your colleagues;
  • Cheat Sheets – I can’t remember how to do anything, cheat sheets built into the editor make it easy to find things, and you can add your own cheat sheets too;
  • Markdown Preview – Markdown is a way  of writing HTML without all the pointy brackets, this package helps you view the output of your Markdown;
  • SublimeRope – a handy package that tells you when your code won’t run and helps with autocompletion. Much better than cryptic error messages when you try to run faulty code. I suspect this is the most useful one so far.
  • Git and GitGutter – integrating Git source control into the editor. Git provides all the Git commands on a menu whilst GitGutter adds markers in the margin (or gutter) showing the revision status. These work nicely on Ubuntu but I haven’t worked out how to configure them on Windows.
  • SublimeREPL – brings a Python prompt into the editor. There are some configuration subtleties here when working with virtual environments.

I know I’ve only touched the surface of Sublime Text but unlike other editors I want to learn more!

Face ReKognition

G8Italy2009

This post was first published at ScraperWiki. The ReKognition API has now been withdrawn.

I’ve previously written about social media and the popularity of our Twitter Search and Followers tools. But how can we make Twitter data more useful to our customers? Analysing the profile pictures of Twitter accounts seemed like an interesting thing to do since they are often the faces of the account holder and a face can tell you a number of things about a person. Such as their gender, age and race. This type of demographic information is useful for marketing, and understanding who your product appeals to. It could also be a way of tying together public social media accounts since people like me use the same image across multiple accounts.

Compact digital cameras have offered face recognition for a while, and on my PC, Picasa churns through my photos identifying people in them. I’ve been doing image analysis for a long time, although never before on faces. My first effort at face recognition involved using the OpenCV library. OpenCV provides a whole suite of image analysis functions which do far more than just detect faces. However, getting it installed and working with the Python bindings on a PC was a bit fiddly, documentation was poor and the built-in face analysis capabilities were poor.

Fast forward a few months, and I spotted that someone had cast the ReKognition API over the images that the British Library had recently released, a dataset I’ve been poking around at too. The ReKognition API takes an image URL and a list of characteristics in which you are interested. These include, gender, race, age, emotion, whether or not you are wearing glasses or, oddly, whether you have your mouth open. Besides this summary information it returns a list of feature locations (i.e. locations in the image of eyes, mouth nose and so forth). It’s straightforward to use.

But who should be the first targets for my image analysis? Obviously, the ScraperWiki team! The pictures are quite small but ReKognition identified I was a “Happy, white, male, age 46 with no glasses on and my mouth shut”. Age 46 is a bit harsh – I’m actually 39 in my profile picture. A second target came out “Happy, Indian, male, age 24.7, with glasses on and mouth shut”. This was fairly accurate, Zarino was 25 when the photo was taken, he is male, has his glasses on but is not Indian. Two (male) members of the team, have still not forgiven ReKognition for describing them as female, particularly the one described as a 14 year old.

Fun as it was, this doesn’t really count as an evaluation of the technology. I investigated further by feeding in the photos of a whole load of famous people. The results of this are shown in the chart below. The horizontal axis is someone’s actual age, the vertical axis shows their age predicted by ReKognition. If the predictions were correct the points representing the celebrities would fall on the solid line. The dotted line shows a linear regression fit to the data. The equation of the line y = 0.673x (I constrained it to pass through zero) tells us that the age is consistently under-predicted by a third, or perhaps celebrities look younger than they really are! The R2 parameter tells us how good the fit is: a value of 0.7591 is not too bad.

ReKognitionFacePeopleChart

I also tried out ReKognition on a couple of class photos – taken at reunions, graduations and so forth. My thinking here being that I would get a cohort of people aged within a year of each other. These actually worked quite well; for older groups of people I got a standard deviation of only 5 years across a group of, typically, 10 people. A primary school class came out at 16+/-9 years, which wasn’t quite so good. I suspect the performance here is related to the fact that such group photos are taken relatively carefully and the lighting and setup for each face in the photo is, by its nature, the same.

Looking across these experiments: ReKognition is pretty good at finding faces in photos, and not find faces where there are none (about 90% accurate). It’s fairly good with gender (getting it right about 80% of the time, typically struggling a bit with younger children), it detects glasses pretty well. I don’t feel I tested it well on race. On age results are variable, for the ScraperWiki set the R^2 value for linear regression between actual and detected ages is about 0.5. Whilst for famous people it is about 0.75. In both cases it tends to under-estimate age and has never given an age above 55 despite being fed several more mature celebrities and grandparents. So on age, it definitely tells you something and under certain circumstances it can be quite accurate. Don’t forget the images we’re looking at are completely unconstrained, they’re not passport photos.

Finally, I applied face recognition to Twitter followers for the ScraperWiki account, and my personal account. The Summarise This Data tool on the ScraperWiki Platform provides a quick overview of the data added by face recognition.

face_recognition_data

It turns out that a little over 50% of the followers of both accounts have a picture of a human face as their profile picture. It’s clear the algorithm makes the odd error mis-identifying things that are not human faces as faces (including the back of a London Taxi Cab). There’s also the odd sketch or cartoon of a face, rather than a photo and some accounts have pictures of famous people, rather than obviously the account holder. Roughly a third of the followers of either account are identified as wearing glasses, three quarters of them look happy. Average ages in both cases were 30. The breakdown in terms of race is 70:13:11:7 White:Asian:Indian:Black. Finally, my followers are approximately 45% female, and those of ScraperWiki are about 30% female.

We’re now geared up to apply this to lists of Twitter followers – are you interested in learning more about your followers? Then send us an email and we’ll be in touch.