Dr Administrator

Author's posts

Adventures in Kaggle: Forest Cover Type Prediction


forest_cover_thumb
This post was first published at ScraperWiki.

Regular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.

Figure1

Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.

figure2

We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).

figure3

There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.

Book review: How Linux works by Brian Ward

 

hlw2e_cover-new_webThis review was first published at ScraperWiki.

A break since my last book review since I’ve been coding, rather than reading, on the commute into the ScraperWiki offices in Liverpool. Next up is How Linux Works by Brian Ward. In some senses this book follows on from Data Science at the Command Line by Jeroen Janssens. Data Science was about doing analysis with command line incantations, How Linux Works tells us about the system in which that command line exists and makes the incantations less mysterious.

I’ve had long experience with doing analysis on Windows machines, typically using Matlab, but over many years I have also dabbled with Unix systems including Silicon Graphics workstations, DEC Alphas and, more recently, Linux. These days I use Ubuntu to ensure compatibility with my colleagues and the systems we deploy to the internet. Increasingly I need to know more about the underlying operating system.

I’m looking to monitor system resources, manage devices and configure my environment. I’m not looking for a list of recipes, I’m looking for a mindset. How Linux Works is pretty good in this respect. I had a fair understanding of pipes in *nix operating systems before reading the book, another fundamental I learnt from How Linux Works was understanding that files are used to represent processes and memory. The book is also good on where these files live – although this varies a bit with distribution and time. Files are used liberally to provide configuration.

The book has 17 chapters covering the basics of Linux and the directory hierarchy, devices and disks, booting the kernel and user space, logging and user management, monitoring resource usage, networking and aspects of shell scripting and developing on Linux systems. They vary considerably in length with those on developing relatively short. There is an odd chapter on rsync.

I got a bit bogged down in the chapters on disks, how the kernel boots, how user space boots and networking. These chapters covered their topics in excruciating detail, much more than required for day to day operations. The user startup chapter tells us about systemd, Upstart and System V init – three alternative mechanisms for booting user space. Systemd is the way of the future, in case you were worried. Similarly, the chapters on booting the kernel and managing disks at a very low level provide more detail than you are ever likely to need. The author does suggest the more casual reader skip through the more advanced areas but frankly this is not a directive I can follow. I start at the beginning of a book and read through to the end, none of this “skipping bits” for me!

The user environments chapter has a nice section explaining clearly the sequence of files accessed for profile information when a terminal window is opened, or other login-like activity. Similarly the chapters on monitoring resources seem to be pitched at just the right level.

Ward’s task is made difficult by the complexity of the underlying system. Linux has an air of “If it’s broke, fix it and if ain’t broke, fix it anyway”. Ward mentions at one point that a service in Linux had not changed for a while therefore it was ripe for replacement! Each new distribution appears to have heard about standardisation (i.e. where to put config files) but has chosen to ignore it. And if there is consistency in the options to Linux commands it is purely co-incidental. I think this is my biggest bugbear in Linux, I know which command to use but the right option flags are more just blindly remembered.

The more Linux-oriented faction of ScraperWiki seemed impressed by the coverage of the book. The chapter on shell scripting is enlightening, providing the mindset rather than the detail, so that you can solve your own problems. It’s also pragmatic in highlighting where to to step in shell scripting and move to another language. I was disturbed to discover that the open-square bracket character in shell script is actually a command. This “explain the big picture rather than trying to answer a load of little questions”, is a mark of a good technical book.  The detail you can find on Stackoverflow or other Googling.

How Linux Works has a good bibliography, it could do with a glossary of commands and an appendix of the more in depth material. That said it’s exactly the book I was looking for, and the writing style is just right. For my next task I will be filleting it for useful commands, and if someone could see their way to giving me a Dell XPS Developer Edition for “review”, I’ll be made up.

Electoral Predictions

As something of a political anorak and a member of a political party, I thought I should make some political predictions for the 2015 General Election (polling on 7th May).

It’s difficult to call between Labour and the Tories as to who will have the most seats post election. The polls show a pretty much dead heat, and I don’t see either party making great gains or loses over the campaign. Labour have had a built in advantage which means that they get more seats for the same percentage of the vote as the Conservatives. It’s not clear if Labour’s “Scottish Problem” removes this advantage. My hunch is the Tories will do a little better than the polls are suggesting but both will have something around the 270 seat mark with Labour in the lead.

As a Liberal Democrat eternal optimism is something of of a pre-requisite. I joined the party when it had 22 seats and 22% of the national vote. I anticipate we’ll still get more than that after this election. I’m holding out for about 40 which is in excess of what pollsters predict. My reason for this optimism is that Liberal Democrat votes are local votes, and once in place Liberal Democrat MPs are difficult to dislodge, basically because they do a good job for their constituents (see here). 

The most interesting thing will be the SNP. I’m finding it a bit difficult to believe the predictions of an SNP landslide in Scotland, perhaps erroneously. My thinking is that they lost the independence referendum by some margin outside the central belt. It’s possible that the rise of the SNP vote represents a wider disaffection with Labour, not seeing them as representing Scottish interests in Westminster. I’m really struggling to believe they will exceed 50 seats but this is my biggest opportunity to be really wrong. Dropping the SNP seat count will benefit Labour.

UKIP will get nowhere, I actually have a bet with someone that they will have no further MPs beyond the two they currently have by defection. I anticipate they will poll relatively highly (i.e. above 10%), similarly the Greens who I don’t think will get a further seat. I suspect many will be jolly glad that UKIPs votes don’t convert into seats, perhaps fewer will be pleased when the same thing happens to the Greens for the same reason.

The likelihood is that no one party or even pair of parties (aside from Labour and Tory) will have sufficient seats to form a stable coalition. The key interesting thing is whether SNP or Liberal Democrats will have sufficient seats to make a coalition government, or at least a confidence-and-supply government with either Labour or Tory. It’s difficult to see the Liberal Democrats going for another round of coalition with the Tories with any enthusiasm, I see a fair chunk of the party being happy to try it with Labour. I’m not sure Labour will want coalition, they appear to have ruled it out absolutely with the SNP.

A Labour – SNP coalition would very definitely not be what we voted for, no English or Welsh voter could have voted for the SNP. The SNP is standing no candidates outside of Scotland. The party leader, Nicola Sturgeon, is not standing for a Westminster seat. The votes SNP MPs make in Westminster will often not effect their constituents, only those of English and Welsh constituents (see here). The SNP has shown little interest in matters south of the border other than to blame the English for all their woes, and to ensure that English students in Scottish universities pay English levels of fees whilst Scots, and indeed students from the rest of the EU, don’t. Not a happy guide as to how they might treat English interests in future. I can see a Labour-SNP coalition being quite damaging for Labour since I can imagine both Tories and Liberal Democrats will make a great deal of this lack of mandate.

Left thinking people, on the whole, appear happy with the electoral status quo. And so too do the Conservatives, the clue is in the name. But truly the Westminster system is broken, the Scots benefit from a fair degree of independence from Westminster with elections held using proportional systems at both local and national level. The rest of the UK could do with the same. 

…when the sun is eclipsed by the moon

Friday 20th March 2015 saw a solar eclipse visible over the British Isles, subject to the vagaries of the British weather. I have some form in taking pictures of the sun through my telescope. With solar eclipses taking place in the UK only once every 10 or so years (the last one was in 1999), I thought it worth the effort to take some pictures.

The key piece of equipment was the Baader AstroSolar filter mount I made a while back. It’s designed to fit on my telescope but works pretty well for naked-eye viewing and with my Canon 600D camera. I used a Canon 70-300m lens, mainly at the maximum zoom with varying exposure parameters depending on cloud. I used autofocus in the main but manually set exposure time, aperture and ISO. Consumer cameras aren’t designed to give good auto exposure for usual activities such as eclipse observations.

Here’s a closeup of the filter:

solarfilter

The uninitiated may not be impressed by the finish on this piece of equipment but as a scientist of 20 years standing I’m happy to report that I’ve had plenty of stuff in my lab in similar style – it’s good enough to do the job.

Solar eclipses last a surprisingly long time, this one was a little over two hours with first contact of the moon on the suns disk at 8:26am in Chester. This photo was taken at 8:26am, you can just see the moon clipping the edge of the sun top right.

IMG_5217 

By 9:01am things were well under way. The birds had started their evening song around this time and it was starting to feel unusually dark for the time of day.

IMG_5227

The maximum of the eclipse was at 09:30am, by this time clouds had appeared and I used them as an ad hoc solar filter.

IMG_5261

By 09:50am we were well past the maximum:

IMG_5283

The last photo I managed was at 10:18 before the sun disappeared behind the clouds:

IMG_5292

Finally, this is a collage of the majority of pictures I took – some of them are pretty rough:

010 - Eclipse - 20mar151

Book review: Engineering Empires by Ben Marsden and Crosbie Smith

engineering-empiresCommonly I read biographies of dead white men in the field of science and technology. My next book is related but a bit different: Engineering Empires: A Cultural History of Technology in Nineteenth-Century Britain by Ben Marsden and Crosbie Smith. This is a more academic tome but rather than focussing on a particular dead white man they are collected together in a broader story. A large part of the book is about steam engines with chapters on static steam engines, steamships and railways but alongside this are chapters on telegraphy and mapping and measurement.

The book starts with a chapter on mapping and measurement,  there’s a lot of emphasis here on measuring the earth’s magnetic field. In the eighteen and nineteenth centuries there was some hope that maps of magnetic field variation might provide help in determining the longitude. The subject makes a reprise later on in the discussion on steamships. The problem isn’t so much the steam but that steamships were typically iron-hulled which throws compass measurements awry unless careful precautions are taken. This was important as steamships were promoted for their claimed superior safety over sailing vessels, but risked running aground on the reef of dodgy compass behaviour in inshore waters. The social context for this chapter is the rise of learned societies to promote such work, the British Association for the Advancement of Science is central here, and is a theme through the book. In earlier centuries the Royal Society was more important.

The next three chapters cover steam power, first in the factory and the mine then in boats and trains. Although James Watt plays a role in the development of steam power, the discussion here is broader covering Ericsson’s caloric engine amongst many other things. Two themes of steam are the professionalisation of the steam engineer, and efficiency. “Professionalisation” in the sense that when businessmen made investments in these relatively capital intensive devices they needed confidence in what they were buying into. A chap that appeared to have just knocked something up in his shed didn’t cut it. Students of physics will be painfully aware of thermodynamics and the theoretical efficiency of engines. The 19th century was when this field started, and it was of intense economic importance. For a static engine efficiency is important because it reduces running costs. For steamships efficiency is crucial, less coal for the same power means you don’t run out of steam mid-ocean!

Switching the emphasis of the book from people to broader themes casts the “heroes” in a new light. It becomes more obvious that Isambard Kingdom Brunel is a bit of an outlier, pushing technology to the limits and sometimes falling off the edge. The Great Eastern was a commercial disaster only gaining a small redemption when it came to lying transatlantic telegraph cables. Success in this area came with the builders of more modest steamships dedicated to particular tasks such as the transatlantic mail and trips to China.

The book finishes with a chapter on telegraphy, my previous exposure to this was via Lord Kelvin who had been involved in the first transatlantic electric telegraphs. The precursor to electric telegraphy was optical telegraphy which had started to be used in France towards the end of the 18th century. Transmission speeds for optical telegraphy were surprisingly high: Paris to Toulon (on the Mediterranean coast), a distance of more than 800km, in 20 minutes. In Britain the telegraph took off when it was linked with the railways which provided a secure, protected route by which to send the lines. Although the first inklings of electric telegraphy came in in mid-18th century it didn’t get going until 1840 or so but by 1880 it was a globe spanning network crossing the Atlantic and reaching the Far east overland. It’s interesting to see the mention of Julius Reuter and Associated Press back at the beginning of electric telegraphy, they are still important names now.

In both steamships and electric telegraphy Britain led the way because it had an Empire to run, and communication is important when you’re running an empire. Electric telegraphy was picked up quickly on the eastern seaboard of the US as well.

I must admit I was a bit put off by the introductory chapter of Engineering Empires which seemed to be a bit heavy and spoke in historological jargon but once underway I really enjoyed the book. I don’t know whether this was simply because I got used to the style or the style changed. As proper historians Marsden and Smith do not refer to scientists in the earlier years of the 19th century as such, they are “gentlemen of science” and later “men of science”. They sound a bit contemptuous of the “gentlemen of science”. The book is a bit austere and worthy looking. Overall I much prefer this manner of presentation of the wider context rather than a focus on a particular individual.