Review of the year: 2013

Liverpool Metropolitan CathedralMy blogging is much reduced this year, at least on my own blog. This is a result of my new job with ScraperWiki and child care, Thomas is now nearly two years old.

I started the year with a couple of posts on my shiny new laptop; working for a startup I’ve escaped from the corporate Dell. One post was on the beast itself – a Sony VAIO, and Windows 8 – Microsoft’s somewhat confusing new operating system offering. The other post was on running Ubuntu on the VAIO. In the past this was a case of setting up dual boot but various innovations make this difficult and there is, in my view, a better solution: a virtual machine.

There wasn’t much ranting this year: I only managed one little one about higher education, and the reluctance amongst lecturers to take any teaching qualifications. The only other marginally opinion piece was on electronic books, where I muttered about DRM limiting the functionality of ebooks.

I managed to read a few books which ended up on my own blog: The Eighth Day of Creation, about the unravelling of the genetic code was a dense, heroic read. The Dinosaur Hunters was light and fluffy. Empire of the Clouds and The Backroom Boys were largely wistful rememberings of Britain’s former greatness in jet aeroplanes and in technology more generally. Chasing Venus and a History of the World in 12 Maps returned to the themes of geodesy and mapping which I’ve explored in the past. Finally, a bit of London history with The Subterranean Railway and Lucy Inglis’ Georgian London. I’ve been following Lucy on twitter since Georgian London was a twinkle in her eye. It’s difficult to choose a favourite amongst these, it’s either History of the World in 12 Maps or Georgian London.

Over on ScraperWiki’s blog I’ve been knocking out blog posts at a great rate, you can see them all here. I did a good deal of book reviewing over there too, my commute into work on the train means I get an hour or so of reading every day – which quickly adds up to a lot of reading! I read about machine learning, data visualisation (this and this), Tableau (this and this), natural language processing, R, Javascript and software engineering. I’m currently ploughing my way through Data Mining: Practical Machine Learning Tools and Techniques. I think my favourite of these was Natural Language Processing with Python. I’m beginning to see the value of the more expensive, better established publishing houses in terms of book quality.

Alongside this I did a few blog posts on new tools for my trade. I’ve long programmed to do scientific analysis but ScraperWiki is a company which sells software, and the discipline of writing software for others to use is different from writing software for yourself, particularly important are testing and source control.

I spoke at a couple of events: Data Science London, and Strata London where I gave an Ignite talk. Ignite talks follow a special format, they are five minutes long and you get 20 slides which advance automatically at the rate of one every 15 seconds – a somewhat frantic experience. My talk is captured on video.

I also did some bits of data analysis; #InspiringWomen was a look at a response to the online bullying and abuse of women. A place in the country was about data on house prices which we had collected for a campaign by Shelter.

Back on my own blog I managed to do a couple of photographic posts, one on Liverpool. The rail loop under Liverpool was closed which meant I had to walk across town to work, and I suddenly realised that Liverpool is rather spectacular architecturally. This led me on to getting the Pevsner Guide to Liverpool. The ScraperWiki office might be a bit unusual in that a quarter of the company owns this book! I also went on a business trip to Trento, which turns out to be a very attractive city, unfortunately I only had my phone with me to take photos.

The last year has highlighted to me what a privilege it was to have so much time to spend on my blogging, photography and garden shed fiddling in the past. It’s what got me my new job but for many, equally able, people this investment of time simply isn’t possible with the other responsibilities they have. Something to consider the next time you’re recruiting, and so highly rating that extra-curricular activity.

Also I realise I have a great deal of theoretical knowledge about a whole pile of technologies but I have spent rather less time on actually doing anything with them, so maybe this coming year there’ll be less reading and more coding on the train.

Happy New Year to you all!

Book review: Tableau 8 – the official guide by George Peck

tableau 8 guideThis review was first published at ScraperWiki.

A while back I reviewed Larry Keller’s book The Tableau 8.0 Training Manual, at the same time I ordered George Peck’s book Tableau 8: the official guide. It’s just arrived. The book comes with a DVD containing bonus videos featuring George Peck’s warm, friendly tones and example workbooks. I must admit to being mildly nonplussed at receiving optical media, my ultrabook lacking an appropriate drive, but I dug out the USB optical drive to load them up. Providing an online link would have allowed the inclusion of up to date material, perhaps covering the version 8.1 announcement.

Tableau is a data visualisation application, aimed at the business intelligence area and optimised to look at database shaped data. I’m using Tableau on a lot of the larger datasets we get at ScraperWiki for sense checking and analysis.

Colleagues have noted that analysis in Tableau looks like me randomly poking buttons in the interface. From Peck’s book I learn that the order in which I carry out random clicking is important since Tableau will make a decision on what you want to see based both on what you have clicked and also its current state.

To my mind the heavy reliance on the graphical interface is one of the drawbacks of Tableau, but clearly, to business intelligence users and journalists, it’s the program’s greatest benefit. It’s a drawback because capturing what you’ve done in a GUI is tricky. Some of the scripting/version control capability is retained since most Tableau files are in plain XML format with which a little fiddling is tacitly approved by Tableau – although you won’t find such info in The Official Guide. I’ve been experimenting with using git source control on workbook files, and it works.

If you’re interested in these more advanced techniques then the Tableau Knowledgebase is worth a look. See this article, for example, on making a custom colour palette. I also like the Information Lab blog, 5 things I wish I knew about Tableau when I started and UK Area Polygon Mapping in TableauThe second post covers one of the bug-bears for non-US users of Tableau: the mapping functionality is quite US-centric.

Peck covers most of the functionality of Tableau, including data connections, making visualisations, a detailed look at mapping, dashboards and so forth. I was somewhat bemused to see the scatter plot described as “esoteric”. This highlights the background of those typically using Tableau: business people not physical scientists, and not necessarily business people who understand database query languages. Hence the heavy reliance on a graphical user interface.

I particularly liked the chapters on data connections which also described the various set, group and combine operations. Finally I understand the difference between data blending and data joining: joining is done at source between tables on the same database whilst blending is done on data from different sources by Tableau, after it has been loaded. The end result is not really different.

I now understand the point of table calculations – they’re for the times when you can’t work out your SQL query. Peck uses different language from Tableau in describing table calculations. He uses “direction” to refer to the order in which cells are processed and “scope” to refer to the groups over which cell calculations are performed. Tableau uses the terms “addressing” and “partitioning” for these two concepts, respectively.

Peck isn’t very explicit about the deep connections between SQL and Tableau but makes sufficient mention of the underlying processes to be useful.

It was nice to see a brief, clear description of the options for publishing Tableau workbooks. Public is handy and free if you want to publish to all. Tableau Online presents a useful halfway house for internal publication whilst Tableau Server gives full flexibility in scheduling updates to data and publishing to a range of audiences with different permission levels. This is something we’re interested in at ScraperWiki.

The book ends with an Appendix of functions available for field calculations.

In some ways Larry Keller and George Peck’s books complement each other, Larry’s book (which I reviewed here) contains the examples that George’s lacks and George’s some of the more in depth discussion missing from Larry’s book.

Overall: a nicely produced book with high production values, good but not encyclopedic coverage.

Git!

logo@2x

This post was first published at ScraperWiki.

As software company, use of some sort of software source control system is inevitable, indeed our CEO wrote TortoiseCVS – a file system overlay for the early CVS source control system. For those uninitiated in the joys of software engineering: source control is a system for recording the history of file revisions allowing programmers to edit their code, safe in the knowledge that they can always revert to a previous good state of code if it all goes horribly wrong. We use Git for source control, hosted either on Github or on Bitbucket. The differing needs of our platform and data services teams fit the payment plans of the two different sites.

Git is a distributed source control system created by Linus Torvalds, to support the development of Linux. Git is an incredibly flexible system which allows you to do pretty much anything. But what should you do? What should be your strategy for collective code development? It’s easy to look up a particular command to do a particular thing, but less is written on how you should string your git commands together. Here we hope to address this lack.

We use the “No Switch Yard” methodology, this involves creating branches from the master branch on which to develop new features and regularly rebasing against the master branch so that when the time comes the feature branch can be merged into the master branch via a pull request with little fuss. We should not be producing a byzantine system by branching feature branches from other feature branches. The aim of “No Switch Yard” is to make the history as simple as possible and make merging branches back onto master as easy as possible.

How do I start?

Assuming that you already have some code in a repository, create a local clone of that repository:

git clone [email protected]:scraperwiki/myproject.git

Create a branch:

git checkout -b my-new-stuff

Start coding…adding files and committing changes as you go:

git add -u
git commit -m "everything is great"

The -u switch to git add simply checks in all the tracked, uncommitted files. Depending on your levels of paranoia you can push your branch back to the remote repository:

git push

How do I understand what’s going on?

For me the key revelation for workflow was to be able to find out my current state and feel pleasure when it was good! To do this, fetch any changes that may have been made on your repository:

git fetch

and then run:

git log --oneline --graph --decorate --all

To see an ASCII art history diagram for your repository. What you are looking for here is a relatively simple branching structure without too many parallel tracks and with the tips of each branch lined up between your local and the remote copy.
You can make an alias to simplify this inspection:

git config --global alias.lg 'log --oneline --graph --decorate'

Then you can just do:

git lg --all

I know someone else has pushed to the master branch from which I branched – what should I do?

If stuff is going on on your master branch, perhaps because your changes are taking a while to complete, you should rebase. You should also do this just before submitting a pull request to merge your work with the master branch.

git rebase -i

Allows you to rebase interactively, this means you can combine multiple commits into a single larger commit. You might want to do this if you made lots of little commits whilst achieving a single goal. Rebasing brings you up to date with another branch, without actually merging your changes into that branch.

I’m done, how do I give my colleagues the opportunity to work on my great new features?

You need to rebase against the remote branch onto which you wish to merge your code and then submit a pull request for your changes. You can submit a pull request from the web interface at Github or Bitbucket. Or you can use a command line tool such as hub.  The idea of using a pull request is that it makes your changes visible to your colleagues, and keeps a clear record of those changes. If you’ve been rebasing regularly you should be able to merge your code automatically.

An important principle here is “ownership”, in social terms you own your local branch on which you are developing a feature, so you can do what you like with it. The master branch from which you started work is in collective ownership so you should only merge changes onto it with the permission of your colleagues and ideally you want others to look at your changes and approve the pull themselves.

I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?

You need to stash your code, using:

git stash

Then create a branch, as described above, and then retrieve the contents of the stash:

git stash pop

That’s how we use git – what do you do?

A place in the country

This post was first published at ScraperWiki.

Recently Shelter came to us asking for data on house prices across the UK to help them with some research in support of campaign on housing affordability.

This is a challenge we’re well suited to address, in fact a large fraction of the ScraperWiki team have scraped property price data for our own purposes. Usually though we just scrape a local area, using the Zoopla API, but Shelter wanted the whole country. It would be possible to do the whole country by this route but rate-limiting would mean it took a few days. So we spoke nicely to Zoopla who generously lifted the rate-limiting for us, they were also very helpfully in responding to our questions about their API.

The raw data for this job amounted to 2 gigabytes, 34 pieces of information for each of 500,000 properties for sale in the UK in August 2013. The data tell us about the location, the sale price, the property details, the estate agent details and the price history of each property.

As usual in these situations we fired up Tableau to get a look at the data, Tableau is well-suited to this type of database-table shaped data and is responsive for this number of lines of data.

What sort of properties are we looking at here?

We can find out this information from the “property type” field, shown in the chart below which counts the number of properties for sale in each property type category. The most common category is “Detached”, followed by “Flat”.

Property_Type

We can also look at the number of bedrooms. Unsurprisingly the number of bedrooms peaks at about 3 but with significant numbers of properties with 4, 5 and 6 bedrooms. Beyond that there are various guest houses, investment properties, parcels of land for sale with nominal numbers of bedrooms culminating in a 150 bedroom “property” which actually sounds like a village.

What about prices?

This is where things get really interesting. Below is a chart of the number of properties for sale in each price £25k price “bin”, for example the bin marked 475k contains all of the houses priced between £475k and £499,950 – the next bin being labelled 500k containing houses priced from £500k to £525k. We can see that the plot here is jagged, the numbers of properties for sale in each bin does not vary smoothly as the price increases, it jumps up and down. In fact this effect is quite regular, for houses priced over £500k there are fewest for sale at the round numbers £500k, £600k etc most for sale at £575k, £675k and so forth.

PriceHistogram_25k

But this doesn’t just effect the super-wealthy – if we zoom into the lower priced region, making our price bins only £1k there is a similar effect with prices ending 4,5 and 9,0 more frequent than those ending 1, 2, 3 or 6, 7, 8. This is all the psychology of pricing.

Distribution of prices around the country?

We can get a biased view of the population distribution simply by plotting all the property for sale locations. ‘Biased’ because, at the very least, varying economic conditions around the country will bias the number of properties for sale.

Density_map_crop

This looks about right, there are voids in the areas of the country which are sparsely populated such as Scotland, Wales, the Peak District and the Lake District.

Finally, we can look at how prices vary around the country – the map below shows the average house price in a region defined by the “outcode” – the first group of letters in a UK postcode. The colour of the points indicates the average price – darkest blue for the lowest average price (£40k) and darkest red for the highest average price (£500k). The size of the dots shows how many properties are for sale in that area.

House Prices by UK Outcode

I’m grateful to be living in the relatively inexpensive North West of England!

There’s plenty more things to look at in this data, for example – the frequency of street names around the UK and the words used by estate agents to describe properties but that is for another day.

That’s what we found – what would you do?

Property information powered by Zoopla

Trento

Normally travel for work is a less than enjoyable experience but this week I’ve been to Trento*, and it was very pleasant. Sadly I only had my mobile phone to take photos.

Trento is in the north of Italy, the bit that is positively Germanic. The second language appears to be German and the cuisine is more alpine than pizza. It’s a small town with a substantial university. It sits  in a broad, bottomed steep-sided valley an hour on the train from Verona on the line that heads up to Bolzano, the Brenner Pass and Austria. I’d not heard of Trento before, I’d heard of the Council of Trent (which refers to the Ecumenical Council of the Roman Catholic Church which was held in Trent between 1545 and 1563).

View on the walk down from Povo

A short walk from the railway station and you are in the heart of the old city, narrow streets with marble pavements faced with buildings which in large part seem to date from the 16th century. The majority of the shops, restaurants and bars embedded in the lower floors are rather swish and classy looking.

Chiesa di Santa Maria Maggiore

The heart of the old town is the Piazza del Duomo, featuring the city’s cathedral, the fountain of Neptune and other fine buildings.

Piazza del Duomo

The cathedral seems to date to some time around the start of the 13th century. It’s been well conserved and the square itself is largely in character. The most similar British cities I know in terms of old architecture are probably Wells and Canterbury most other British cities either never had substantial buildings of such age, or they were replaced at some point since.

The fontana del nettune is quite blingy:

Fontana del Nettuno

Next door to my hotel, Chiesa di Santa Maria Maggiore:

Chiesa di Santa Maria Maggiore

Alongside the ecclesiastical buildings are some fine townhouses. Romeo and Juliet was set in Verona, an hour down the railway line – I wonder if this is how the balcony Juliet stood in looked:

Palazzo Quetta Alberti-Colico

This is the Palazzo Quetta Alberti-Colico. There’s also the rather nice Palazzo Geremia (pdf)

Palazzo Geremia

 

On the edge of town, close to the bridge over the river, the Torre Vanga, erected originally in 1210, is palimpsest of masonry and brick.

Torre Vanga

And as well as that there are some impressive entrances:

Duomo di TrentoChiesa della Santissima Trinità39 Via Rodolfo Belenzani

 

Definitely worth a day trip if you are in the area, and great if you have business as the university!

*Unsurprisingly the Italian wikipedia entry is much more extensive.