The Third Way

macbook_airOperating Systems were the great religious divide of our age.

A little over a year ago I was writing about my experiences setting up my Sony Vaio Windows 8 laptop to run Ubuntu on a virtual machine. Today I am exploring the Third Way – I’m writing this on a MacBook Air. This is the result of a client requirement: I’m working with the Government Digital Service who are heavily Mac oriented.

I think this makes me a secularist in computing terms.

My impressions so far:

Things got off to a slightly rocky start, the MacBook I’m using is a hand-me-down from a departing colleague. We took the joint decision to start from scratch on this machine and as a result of some cavalier disk erasing we ended up with a non-booting MacBook. In theory we should have been able to do a reinstall over the internet, in practice this didn’t work. So off I marched to our local Apple Store to get things sorted. The first time I’d entered such an emporium. I was to leave disappointed, it turned out I needed to make an appointment for a “Genius” to triage my laptop and the next appointment was a week hence, and I couldn’t leave the laptop behind for a “Genius Triage”. Alternatively, I could call Apple Care.

As you may guess this Genius language gets my goat! My mate Zarino was an Apple College Rep – should they have called him a Jihadi? Could you work non-ironically with a job title of Genius?

Somewhat bizarrely, marching the Air to the Apple Store and back fixed the problem, and an hour or so later I had a machine with an operating system. Perhaps it received a special essence from the mothership. On successfully booting my first actions were to configure my terminal. For the initiated the terminal is the thing that looks like computing from the early 80s – you type in commands at a prompt and are rewarded with more words in return. The reason for this odd choice was the intended usage. This MacBook is for coding, so next up was installing Sublime Text. I now have an environment for coding which superficial looks like the terminal/editor combination I use in Windows and Ubuntu!

It’s worth noting that for the MacBook the bash terminal I am using is a native part of the operating system, as it is for the Ubuntu VM on Windows the bash terminal is botched on to make various open source tools work.

Physically the machine is beautiful. My Vaio is quite pretty but compared to the Air it is fat and heavy. It has no hard disk indicator light. It has no hard disk, rather a 256GB SSD which means it boots really fast. 256GB is a bit small for me these days, with a title of data scientist I tend to stick big datasets on my laptop.

So far I’ve been getting used to using cmd+c and cmd+v to copy and paste, having overwritten stuff repeatedly with “v” having done the Windows ctrl+v. I’m getting used to the @ and ” keys being in the wrong place. And the menu bar for applications always appearing at the top of the screen, not the top of the application window. Fortunately the trackpad I can configure to simulate a two button mouse, rather than the default one button scheme. I find the Apple menu bar at the top a bit too small and austere and the Dock at the bottom is a bit cartoony. The Notes application is a travesty, a little faux notebook although I notice in OS X Mavericks it is more business-like.

For work I don’t anticipate any great problems in working entirely on a Mac, we use Google Apps for email and make extensive use of Google Docs. We use online services like Trello, GitHub and Pivotal in place of client side applications. Most the coding I do is in Python. The only no go area is Tableau which is currently only available on Windows.

I’ve never liked the OS wars, perhaps it was a transitional thing. I grew up in a time when there were a plethora of home computers. I’ve written programs on TRS-80s, Commodore VIC20, Amstrad CPC464s, Sinclair ZX81 and been aware of many more. At work I’ve used Dec Alphas, VAX/VMS and also PCs and Macs. Latterly everything is one the web, so the OS is just a platform for a browser.

I’m thinking of strapping the Air and the Vaio back to back to make a triple booting machine!

Messier and messier

Regular readers with a good memory will recall I bought a telescope about 18 months ago. I bemoaned the fact that I bought it in late Spring, since it meant it got dark rather late. I will note here that astronomy is generally incompatible with a small child who might wake you up in the middle of the night, requiring attention and early nights.

Since then I’ve taken pictures of the sun, the moon, Jupiter, Saturn and as a side project I also took wide angle photos of the Milky Way and star trails (telescope not required). Each of these bought their own challenges, and awe. The sun because it’s surprisingly difficult to find the thing in you view finder with the serious filter required to stop you blinding yourself when you do find it. The moon because it’s just beautiful and fills the field of view, rippling through the “seeing” or thermal turbulence of the atmosphere. Jupiter because of it’s Galilean moons, first observed by Galileo in 1610. Saturn because of it’s tiny ears, I saw Saturn on my first night of proper viewing. As the tiny image of Saturn floated across my field of view I was hopping up and down with excitement like a child.

I’ve had a bit of a hiatus in the astrophotography over the past year but I’m ready to get back into it.

My next targets for astrophotography are the Deep Sky Objects (DSOs), these are largish faint things as opposed to planets which are smallish bright things. My accidental wide-angle photos clued me into the possibilities here. I’d been trying to photograph constellations, which turn out to be a bit dull, at the end of the session I put the sensitivity of my camera right up and increased the exposure time and suddenly the Milky Way appeared! Even in rural Wales it was only just visible to the naked eye.

Now I’m keen to explore more of these faint objects. The place to start is the Messier Catolog of objects. This was compiled by Charles Messier and Pierre Méchain in the latter half of the 18th century. You may recognise the name Méchain, he was one of the two French men who surveyed France on the cusp of the Revolution to define a value for the meter. Ken Alder’s book The Measure of All Things, describes their adventures.

Messier and Mechain weren’t interested in the deep sky objects, they were interested in comets and compiled the list in order not to be distracted from their studies by other non-comety objects. The list is comprised of star clusters, nebula and galaxies. I must admit to being a bit dismissive of star clusters. The Messier list is by no means exhaustive, observations were all made in France with a small telescope so there are no objects from the Southern skies. But they are ideal for amateur astronomers in the Northern hemisphere since the high tech, professional telescope of the 18th century is matched by the consumer telescope of the 21st.

I’ve know of the Messier objects since I was a child but I have no intuition as to where they are, how bright and how big they are. So to get me started I found some numbers and made some plots.

The first plot shows where the objects are in the sky. They are labelled, somewhat fitfully with their Messier number and common name. Their locations are shown by declination, how far away from the celestial equator an object is, towards the North Pole and right ascension, how far around it is along a line of celestial latitude. I’ve added the moon to the plot in a fixed position close to the top left. As you can see the majority of the objects are North of the celestial equator. The size of the symbols indicates the relative size of the objects. The moon is shown to the same scale and we can see that a number of the objects are larger than the moon, these are often star clusters but galaxies such as Andromeda – the big purple blob on the right and the Triangulum Galaxy are also bigger than the moon. As is the Orion nebula.

Position

So why aren’t we as familiar with these objects as we are with the moon. The second plot shows how bright the Messier objects are and their size. The horizontal axis shows their apparent size – it’s a linear scale so that an object twice as far from the vertical axis is twice as big. Note that these are apparent sizes, some things appear larger than others because they are closer. The Messier The vertical axis shows the apparent brightness, in astronomy brightness is measured in units of “magnitude” which is a logarithmic scale. This means that although the sun is roughly magnitude –26 and the moon is roughly magnitude –13, the sun is 10,000 times bright than the moon. The Messier objects are all much dimmer than Venus, Jupiter and Mercury and generally dimmer than Saturn.

Size-Magnitude

 

So the Messier objects are often bigger but dimmer than things I have already photographed. But wait, the moon fills the field of view of my telescope. And not only that my telescope has an aperture of f/10 – a measure of it’s light gathering power. This is actually rather “slow” for a camera lens, my “fastest” lens is f/1.4 which represents a 50 fold larger light gathering power.

For these two reasons I have ordered a new lens for my camera, a Samyang 500mm f/6.3 this is going to give me a bigger field of view than my telescope which has a focal length of 1250mm. And also more light gathering power – my new lens should have more than double the light gathering power!

Watch this space for the results of my new purchase!

Sublime

sublime_text

Sublime Text

Coders can be obsessive about their text editors. Dividing into relatively good natured camps. It is text editors not development environments over which they obsess and the great schism is between is between the followers of vim and those of Emacs. The line between text editor and development environment can be a bit fuzzy. A development environment is designed to help you do all the things required to make working software (writing, testing, compiling, linking, debugging, organising projects and libraries), whilst a text editor is designed to edit text. But sometimes text editors get mission creep.

vim and emacs are both editors with long pedigree on Unix systems. vim‘s parent, vi came into being in 1976, with vim being born in 1991, vim stands for “Vi Improved”. Emacs was also born in 1976. Glancing at the emacs wikipedia page I see there are elements of religiosity in the conflict between them.

To users of OS X and Windows, vim and emacs look and feel, frankly, bizarre. They came into being when windowed GUI interfaces didn’t exist. In basic mode they offer a large blank screen with no icons or even text menu items. There is a status line and a command line at the bottom of the screen. Users interact by issuing keyboard commands, they are interfaces with only keyboard shortcuts. It’s said that the best way to generate a random string of characters is to put a class of naive computer science undergraduates down in front of vim and tell them to save the file and exit the program! In fact to demonstrate the point, I’ve just trapped myself in emacs  whilst trying to take a screen shot.

selinux_vim_0

vim, image by Hermann Uwe

GNU emacs-[1]

emacs, image by David Mundy

vim and emacs are both incredibly extensible, they’re written by coders for coders. As a measure of their flexibility: you can get twitter clients which run inside them.

I’ve used both emacs and vim but not warmed to either of them. I find them ugly to look at and confusing, I don’t sit in front on an editor enough of the day to make remembering keyboard shortcuts a comfortable experience. I’ve used the Matlab, Visual Studio and Spyder IDEs but never felt impassioned enough to write a blog post about them. I had a bad experience with Eclipse, which led to one of my more valued Stackoverflow answers.

But now I’ve discovered Sublime Text.

Sublime Text is very beautiful, particularly besides vim and emacs. I like the little inset in the top right of my screen which shows the file I’m working on from an eagle’s perspective, the nice rounded tabs. The colour scheme is subtle and muted, and I can get a panoply of variants on the theme. At Unilever we used to talk about trying to delight consumers with our products – Sublime Text does this. My only wish is that it went the way of Google Chrome and got rid of the Windows bar at the top.

Not only this, as with emacs and vim, I can customise Sublime Text with code or use other packages other people have written and in my favoured language, Python.

I use Sublime Text mainly to code in Python, using a Git Bash prompt to run code and to check it into source control. At the moment I have the following packages installed:

  • Package Control – for some reasons the thing that makes it easy to add new packages to Sublime Text comes as a separate package which you need to install manually;
  • PEP8 Autoformat – languages have style guides. Soft guidelines to ensure consistent use of whitespace, capitalisation and so forth. Some people get very up tight about style. PEP8 is the Python style guide, and PEP8 autoformat allows you to effortlessly conform to the style guide and so avoid friction with your colleagues;
  • Cheat Sheets – I can’t remember how to do anything, cheat sheets built into the editor make it easy to find things, and you can add your own cheat sheets too;
  • Markdown Preview – Markdown is a way  of writing HTML without all the pointy brackets, this package helps you view the output of your Markdown;
  • SublimeRope – a handy package that tells you when your code won’t run and helps with autocompletion. Much better than cryptic error messages when you try to run faulty code. I suspect this is the most useful one so far.
  • Git and GitGutter – integrating Git source control into the editor. Git provides all the Git commands on a menu whilst GitGutter adds markers in the margin (or gutter) showing the revision status. These work nicely on Ubuntu but I haven’t worked out how to configure them on Windows.
  • SublimeREPL – brings a Python prompt into the editor. There are some configuration subtleties here when working with virtual environments.

I know I’ve only touched the surface of Sublime Text but unlike other editors I want to learn more!

Face ReKognition

G8Italy2009

This post was first published at ScraperWiki. The ReKognition API has now been withdrawn.

I’ve previously written about social media and the popularity of our Twitter Search and Followers tools. But how can we make Twitter data more useful to our customers? Analysing the profile pictures of Twitter accounts seemed like an interesting thing to do since they are often the faces of the account holder and a face can tell you a number of things about a person. Such as their gender, age and race. This type of demographic information is useful for marketing, and understanding who your product appeals to. It could also be a way of tying together public social media accounts since people like me use the same image across multiple accounts.

Compact digital cameras have offered face recognition for a while, and on my PC, Picasa churns through my photos identifying people in them. I’ve been doing image analysis for a long time, although never before on faces. My first effort at face recognition involved using the OpenCV library. OpenCV provides a whole suite of image analysis functions which do far more than just detect faces. However, getting it installed and working with the Python bindings on a PC was a bit fiddly, documentation was poor and the built-in face analysis capabilities were poor.

Fast forward a few months, and I spotted that someone had cast the ReKognition API over the images that the British Library had recently released, a dataset I’ve been poking around at too. The ReKognition API takes an image URL and a list of characteristics in which you are interested. These include, gender, race, age, emotion, whether or not you are wearing glasses or, oddly, whether you have your mouth open. Besides this summary information it returns a list of feature locations (i.e. locations in the image of eyes, mouth nose and so forth). It’s straightforward to use.

But who should be the first targets for my image analysis? Obviously, the ScraperWiki team! The pictures are quite small but ReKognition identified I was a “Happy, white, male, age 46 with no glasses on and my mouth shut”. Age 46 is a bit harsh – I’m actually 39 in my profile picture. A second target came out “Happy, Indian, male, age 24.7, with glasses on and mouth shut”. This was fairly accurate, Zarino was 25 when the photo was taken, he is male, has his glasses on but is not Indian. Two (male) members of the team, have still not forgiven ReKognition for describing them as female, particularly the one described as a 14 year old.

Fun as it was, this doesn’t really count as an evaluation of the technology. I investigated further by feeding in the photos of a whole load of famous people. The results of this are shown in the chart below. The horizontal axis is someone’s actual age, the vertical axis shows their age predicted by ReKognition. If the predictions were correct the points representing the celebrities would fall on the solid line. The dotted line shows a linear regression fit to the data. The equation of the line y = 0.673x (I constrained it to pass through zero) tells us that the age is consistently under-predicted by a third, or perhaps celebrities look younger than they really are! The R2 parameter tells us how good the fit is: a value of 0.7591 is not too bad.

ReKognitionFacePeopleChart

I also tried out ReKognition on a couple of class photos – taken at reunions, graduations and so forth. My thinking here being that I would get a cohort of people aged within a year of each other. These actually worked quite well; for older groups of people I got a standard deviation of only 5 years across a group of, typically, 10 people. A primary school class came out at 16+/-9 years, which wasn’t quite so good. I suspect the performance here is related to the fact that such group photos are taken relatively carefully and the lighting and setup for each face in the photo is, by its nature, the same.

Looking across these experiments: ReKognition is pretty good at finding faces in photos, and not find faces where there are none (about 90% accurate). It’s fairly good with gender (getting it right about 80% of the time, typically struggling a bit with younger children), it detects glasses pretty well. I don’t feel I tested it well on race. On age results are variable, for the ScraperWiki set the R^2 value for linear regression between actual and detected ages is about 0.5. Whilst for famous people it is about 0.75. In both cases it tends to under-estimate age and has never given an age above 55 despite being fed several more mature celebrities and grandparents. So on age, it definitely tells you something and under certain circumstances it can be quite accurate. Don’t forget the images we’re looking at are completely unconstrained, they’re not passport photos.

Finally, I applied face recognition to Twitter followers for the ScraperWiki account, and my personal account. The Summarise This Data tool on the ScraperWiki Platform provides a quick overview of the data added by face recognition.

face_recognition_data

It turns out that a little over 50% of the followers of both accounts have a picture of a human face as their profile picture. It’s clear the algorithm makes the odd error mis-identifying things that are not human faces as faces (including the back of a London Taxi Cab). There’s also the odd sketch or cartoon of a face, rather than a photo and some accounts have pictures of famous people, rather than obviously the account holder. Roughly a third of the followers of either account are identified as wearing glasses, three quarters of them look happy. Average ages in both cases were 30. The breakdown in terms of race is 70:13:11:7 White:Asian:Indian:Black. Finally, my followers are approximately 45% female, and those of ScraperWiki are about 30% female.

We’re now geared up to apply this to lists of Twitter followers – are you interested in learning more about your followers? Then send us an email and we’ll be in touch.

Book review: Hadoop in Action by Chuck Lam

HadoopInAction

This review was first published at ScraperWiki.

Hadoop in Action by Chuck Lam provides a brief, fairly technical introduction to the Hadoop Big Data ecosystem. Hadoop is an open source implementation of the MapReduce framework originally developed by Google to process huge quantities of web search data. The name MapReduce, refers to dividing up jobs amongst multiple processors (“Mapping”) and then recombining results to provide an answer to the problem (“Reducing”). Hadoop, allows users to process large quantities of data by distributing it across a network of relatively cheap computers. Each computer in the network has a portion of the data to process, and at the end it is combined together to give the final reult. Hadoop provides the infrastructure to enable this. In a sense it is a distributed operated system which provides fundamental services to applications such as Hive and Pig.

At ScraperWiki we’ve had many philosophical discussions about the meaning of Big Data. What size is Big Data? Is it one million lines? Is it one billion lines? Should we express it in terms of gigabytes and terabytes rather than lines?

For many, Big Data is data that requires you use Hadoop or similar to process.

Our view is that Big Data is data big enough to break the tools or hardware you commonly use, so for many of our customers this is a software limit based on Microsoft Excel. Technically Excel can handle a million or so lines but practically speaking life gets uncomfortable as you go above a few tens of thousands of rows.

The largest dataset a customer has come to us with so far is the UK MOT test data – results of the roadworthiness test for every vehicle on the road in the UK, over three years old. This dataset is about 100 million lines and 4 gigabytes per year, it’s available back to 2005 giving a total of approximately 1 billion lines and 32GB. A single year can be readily analysed on an Intel i7 laptop with 8GB RAM, MySQL and Tableau by readily I mean that some indexing jobs can take up to an hour but once indexed most queries are 10 – 20 minutes maximum.

At ScraperWiki a number of us have backgrounds in the physical sciences where we’ve been interested in computational intensive simulations involving clusters of commodity hardware which pre-date Hadoop (Beowulf), or big data from the Large Hadron Collider. Physical scientists have long been interested in parallel computing where the amount of data to move around is small but the amount of calculation to do is large. The point here is that parallel computing is possible for a subset of problems where a task can be divided up into smaller chunks of data and processing to be distributed amongst multiple processors or machines. In the case of photorealistic computer graphics rendering this might be frames of video, or a portion of a whole scene. Software like Matlab, Fortran and computer graphics renderers can parallelise certain operations with relative ease. The difficulty has always been turning your big computing problem into one of those “certain operations”. The Large Hadron Collider is an example more suited to the Hadoop style approach, the data flows are enormous but the calculations performed on that data are, comparatively less troublesome.

Hadoop in Action spends a significant amount of time discussing the core Hadoop system and MapReduce processing framework. I must admit to finding this part rather dull. I perked up when we reached Pig, described as a data processing language and Hive, a SQL-like system. One gets the impression the Pig system was built around a naming convention pushed too far: the Pig commandline is called Grunt and the language used by Pig is Pig Latin. Pig and Hive look like systems where I could sit down and run some queries with a language looks like my old friend, SQL.

The book finishes with some case studies, these are an image conversion problem, machine learning and data processing at China Mobile, the Stumbleupon social bookmarking system and providing search for IBM’s intranet. In the latter three cases users were migrating from SQL based systems running on monolithic hardware. To give an idea of scale: China Mobile collect terabytes of data per day across 100s of millions of customers, the IBM intranet has something like 100 million pages and 16 million documents and Stumbupon has 25 million users clicking their Stumble buttons about 1 billion times in a month.

Overall, a handy introduction to Hadoop although perhaps oddly pitched – it’s probably too technical for most managers, not technical enough for system administrators and with insufficient applications for data scientists. If you want to get hands on experience of Hadoop, then the Hortonworks Sandbox provides a pre-packaged virtual machine with a web interface for you to try out the various technologies.

If you want us to help you get value out of your big data or even Big Data, please get in touch!