Author's posts
Nov 01 2013
Book review: Tableau 8 – the official guide by George Peck
This review was first published at ScraperWiki.
A while back I reviewed Larry Keller’s book The Tableau 8.0 Training Manual, at the same time I ordered George Peck’s book Tableau 8: the official guide. It’s just arrived. The book comes with a DVD containing bonus videos featuring George Peck’s warm, friendly tones and example workbooks. I must admit to being mildly nonplussed at receiving optical media, my ultrabook lacking an appropriate drive, but I dug out the USB optical drive to load them up. Providing an online link would have allowed the inclusion of up to date material, perhaps covering the version 8.1 announcement.
Tableau is a data visualisation application, aimed at the business intelligence area and optimised to look at database shaped data. I’m using Tableau on a lot of the larger datasets we get at ScraperWiki for sense checking and analysis.
Colleagues have noted that analysis in Tableau looks like me randomly poking buttons in the interface. From Peck’s book I learn that the order in which I carry out random clicking is important since Tableau will make a decision on what you want to see based both on what you have clicked and also its current state.
To my mind the heavy reliance on the graphical interface is one of the drawbacks of Tableau, but clearly, to business intelligence users and journalists, it’s the program’s greatest benefit. It’s a drawback because capturing what you’ve done in a GUI is tricky. Some of the scripting/version control capability is retained since most Tableau files are in plain XML format with which a little fiddling is tacitly approved by Tableau – although you won’t find such info in The Official Guide. I’ve been experimenting with using git source control on workbook files, and it works.
If you’re interested in these more advanced techniques then the Tableau Knowledgebase is worth a look. See this article, for example, on making a custom colour palette. I also like the Information Lab blog, 5 things I wish I knew about Tableau when I started and UK Area Polygon Mapping in Tableau. The second post covers one of the bug-bears for non-US users of Tableau: the mapping functionality is quite US-centric.
Peck covers most of the functionality of Tableau, including data connections, making visualisations, a detailed look at mapping, dashboards and so forth. I was somewhat bemused to see the scatter plot described as “esoteric”. This highlights the background of those typically using Tableau: business people not physical scientists, and not necessarily business people who understand database query languages. Hence the heavy reliance on a graphical user interface.
I particularly liked the chapters on data connections which also described the various set, group and combine operations. Finally I understand the difference between data blending and data joining: joining is done at source between tables on the same database whilst blending is done on data from different sources by Tableau, after it has been loaded. The end result is not really different.
I now understand the point of table calculations – they’re for the times when you can’t work out your SQL query. Peck uses different language from Tableau in describing table calculations. He uses “direction” to refer to the order in which cells are processed and “scope” to refer to the groups over which cell calculations are performed. Tableau uses the terms “addressing” and “partitioning” for these two concepts, respectively.
Peck isn’t very explicit about the deep connections between SQL and Tableau but makes sufficient mention of the underlying processes to be useful.
It was nice to see a brief, clear description of the options for publishing Tableau workbooks. Public is handy and free if you want to publish to all. Tableau Online presents a useful halfway house for internal publication whilst Tableau Server gives full flexibility in scheduling updates to data and publishing to a range of audiences with different permission levels. This is something we’re interested in at ScraperWiki.
The book ends with an Appendix of functions available for field calculations.
In some ways Larry Keller and George Peck’s books complement each other, Larry’s book (which I reviewed here) contains the examples that George’s lacks and George’s some of the more in depth discussion missing from Larry’s book.
Overall: a nicely produced book with high production values, good but not encyclopedic coverage.
Oct 29 2013
Git!
This post was first published at ScraperWiki.
As software company, use of some sort of software source control system is inevitable, indeed our CEO wrote TortoiseCVS – a file system overlay for the early CVS source control system. For those uninitiated in the joys of software engineering: source control is a system for recording the history of file revisions allowing programmers to edit their code, safe in the knowledge that they can always revert to a previous good state of code if it all goes horribly wrong. We use Git for source control, hosted either on Github or on Bitbucket. The differing needs of our platform and data services teams fit the payment plans of the two different sites.
Git is a distributed source control system created by Linus Torvalds, to support the development of Linux. Git is an incredibly flexible system which allows you to do pretty much anything. But what should you do? What should be your strategy for collective code development? It’s easy to look up a particular command to do a particular thing, but less is written on how you should string your git commands together. Here we hope to address this lack.
We use the “No Switch Yard” methodology, this involves creating branches from the master branch on which to develop new features and regularly rebasing against the master branch so that when the time comes the feature branch can be merged into the master branch via a pull request with little fuss. We should not be producing a byzantine system by branching feature branches from other feature branches. The aim of “No Switch Yard” is to make the history as simple as possible and make merging branches back onto master as easy as possible.
How do I start?
Assuming that you already have some code in a repository, create a local clone of that repository:
git clone [email protected]:scraperwiki/myproject.git
Create a branch:
git checkout -b my-new-stuff
Start coding…adding files and committing changes as you go:
git add -u
git commit -m "everything is great"
The -u switch to git add simply checks in all the tracked, uncommitted files. Depending on your levels of paranoia you can push your branch back to the remote repository:
git push
How do I understand what’s going on?
For me the key revelation for workflow was to be able to find out my current state and feel pleasure when it was good! To do this, fetch any changes that may have been made on your repository:
git fetch
and then run:
git log --oneline --graph --decorate --all
To see an ASCII art history diagram for your repository. What you are looking for here is a relatively simple branching structure without too many parallel tracks and with the tips of each branch lined up between your local and the remote copy.
You can make an alias to simplify this inspection:
git config --global alias.lg 'log --oneline --graph --decorate'
Then you can just do:
git lg --all
I know someone else has pushed to the master branch from which I branched – what should I do?
If stuff is going on on your master branch, perhaps because your changes are taking a while to complete, you should rebase. You should also do this just before submitting a pull request to merge your work with the master branch.
git rebase -i
Allows you to rebase interactively, this means you can combine multiple commits into a single larger commit. You might want to do this if you made lots of little commits whilst achieving a single goal. Rebasing brings you up to date with another branch, without actually merging your changes into that branch.
I’m done, how do I give my colleagues the opportunity to work on my great new features?
You need to rebase against the remote branch onto which you wish to merge your code and then submit a pull request for your changes. You can submit a pull request from the web interface at Github or Bitbucket. Or you can use a command line tool such as hub. The idea of using a pull request is that it makes your changes visible to your colleagues, and keeps a clear record of those changes. If you’ve been rebasing regularly you should be able to merge your code automatically.
An important principle here is “ownership”, in social terms you own your local branch on which you are developing a feature, so you can do what you like with it. The master branch from which you started work is in collective ownership so you should only merge changes onto it with the permission of your colleagues and ideally you want others to look at your changes and approve the pull themselves.
I started doing some fiddling around with my code and now I realise it’s serious and I want to put it on a branch, what did I do?
You need to stash your code, using:
git stash
Then create a branch, as described above, and then retrieve the contents of the stash:
git stash pop
That’s how we use git – what do you do?
Oct 23 2013
A place in the country
This post was first published at ScraperWiki.
Recently Shelter came to us asking for data on house prices across the UK to help them with some research in support of campaign on housing affordability.
This is a challenge we’re well suited to address, in fact a large fraction of the ScraperWiki team have scraped property price data for our own purposes. Usually though we just scrape a local area, using the Zoopla API, but Shelter wanted the whole country. It would be possible to do the whole country by this route but rate-limiting would mean it took a few days. So we spoke nicely to Zoopla who generously lifted the rate-limiting for us, they were also very helpfully in responding to our questions about their API.
The raw data for this job amounted to 2 gigabytes, 34 pieces of information for each of 500,000 properties for sale in the UK in August 2013. The data tell us about the location, the sale price, the property details, the estate agent details and the price history of each property.
As usual in these situations we fired up Tableau to get a look at the data, Tableau is well-suited to this type of database-table shaped data and is responsive for this number of lines of data.
What sort of properties are we looking at here?
We can find out this information from the “property type” field, shown in the chart below which counts the number of properties for sale in each property type category. The most common category is “Detached”, followed by “Flat”.
We can also look at the number of bedrooms. Unsurprisingly the number of bedrooms peaks at about 3 but with significant numbers of properties with 4, 5 and 6 bedrooms. Beyond that there are various guest houses, investment properties, parcels of land for sale with nominal numbers of bedrooms culminating in a 150 bedroom “property” which actually sounds like a village.
What about prices?
This is where things get really interesting. Below is a chart of the number of properties for sale in each price £25k price “bin”, for example the bin marked 475k contains all of the houses priced between £475k and £499,950 – the next bin being labelled 500k containing houses priced from £500k to £525k. We can see that the plot here is jagged, the numbers of properties for sale in each bin does not vary smoothly as the price increases, it jumps up and down. In fact this effect is quite regular, for houses priced over £500k there are fewest for sale at the round numbers £500k, £600k etc most for sale at £575k, £675k and so forth.
But this doesn’t just effect the super-wealthy – if we zoom into the lower priced region, making our price bins only £1k there is a similar effect with prices ending 4,5 and 9,0 more frequent than those ending 1, 2, 3 or 6, 7, 8. This is all the psychology of pricing.
Distribution of prices around the country?
We can get a biased view of the population distribution simply by plotting all the property for sale locations. ‘Biased’ because, at the very least, varying economic conditions around the country will bias the number of properties for sale.
This looks about right, there are voids in the areas of the country which are sparsely populated such as Scotland, Wales, the Peak District and the Lake District.
Finally, we can look at how prices vary around the country – the map below shows the average house price in a region defined by the “outcode” – the first group of letters in a UK postcode. The colour of the points indicates the average price – darkest blue for the lowest average price (£40k) and darkest red for the highest average price (£500k). The size of the dots shows how many properties are for sale in that area.
I’m grateful to be living in the relatively inexpensive North West of England!
There’s plenty more things to look at in this data, for example – the frequency of street names around the UK and the words used by estate agents to describe properties but that is for another day.
That’s what we found – what would you do?
Oct 06 2013
Trento
Normally travel for work is a less than enjoyable experience but this week I’ve been to Trento*, and it was very pleasant. Sadly I only had my mobile phone to take photos.
Trento is in the north of Italy, the bit that is positively Germanic. The second language appears to be German and the cuisine is more alpine than pizza. It’s a small town with a substantial university. It sits in a broad, bottomed steep-sided valley an hour on the train from Verona on the line that heads up to Bolzano, the Brenner Pass and Austria. I’d not heard of Trento before, I’d heard of the Council of Trent (which refers to the Ecumenical Council of the Roman Catholic Church which was held in Trent between 1545 and 1563).
A short walk from the railway station and you are in the heart of the old city, narrow streets with marble pavements faced with buildings which in large part seem to date from the 16th century. The majority of the shops, restaurants and bars embedded in the lower floors are rather swish and classy looking.
The heart of the old town is the Piazza del Duomo, featuring the city’s cathedral, the fountain of Neptune and other fine buildings.
The cathedral seems to date to some time around the start of the 13th century. It’s been well conserved and the square itself is largely in character. The most similar British cities I know in terms of old architecture are probably Wells and Canterbury most other British cities either never had substantial buildings of such age, or they were replaced at some point since.
The fontana del nettune is quite blingy:
Next door to my hotel, Chiesa di Santa Maria Maggiore:
Alongside the ecclesiastical buildings are some fine townhouses. Romeo and Juliet was set in Verona, an hour down the railway line – I wonder if this is how the balcony Juliet stood in looked:
This is the Palazzo Quetta Alberti-Colico. There’s also the rather nice Palazzo Geremia (pdf)
On the edge of town, close to the bridge over the river, the Torre Vanga, erected originally in 1210, is palimpsest of masonry and brick.
And as well as that there are some impressive entrances:
Definitely worth a day trip if you are in the area, and great if you have business as the university!
*Unsurprisingly the Italian wikipedia entry is much more extensive.
Oct 05 2013
Book Review: Backroom Boys by Francis Spufford
Electronic books bring many advantages but for a lengthy journey to Trento a paper book seemed more convenient. So I returned to my shelves to pick up Backroom Boys: The Secret Return of the British Boffin by Francis Spufford.
I first read this book quite some time ago, it tells six short stories of British technical innovation. It is in the character of Empire of the Clouds and A computer called LEO. Perhaps a little nationalistic and regretful of opportunities lost.
The first of the stories is of the British space programme after the war, it starts with the disturbing picture of members of the British Interplanetary Society celebrating the fall of a V2 rocket in London. This leads on to a brief discussion of Blue Streak – Britain’s ICBM, scrapped in favour of the American Polaris missile system. As part of the Blue Streak programme a rocket named Black Knight was developed to test re-entry technology from this grew the Black Arrow – a rocket to put satellites into space.
In some ways Black Arrow was a small, white elephant from the start. The US had offered the British free satellite launches. Black Arrow was run on a shoestring budget, kept strictly as an extension of the Black Knight rocket and hence rather small. The motivation for this was nominally that it could be used to gain experience for the UK satellite industry and provide an independent launch system for the UK government, perhaps for things they wished to keep quiet. Ultimately it launched a single test satellite into space, still orbiting the earth now. However, it was too small to launch the useful satellites of the day and growing it would require complete redevelopment. The programme was cancelled in 1971.
Next up is Concorde, which could probably be better described as a large, white elephant. Developed in a joint Anglo-French programme into which the participants were mutually locked it burned money for nearly two decades before the British part was taken on by British Airways who used it to enhance the prestige of their brand. As a workhorse, commercial jet, it was poor choice: too small, too thirsty, and too loud.
But now for something more successful! Long ago there existed a home computer market in the UK, populated by many and various computers. First amongst these early machines was the BBC Micro. For which the first blockbuster game, Elite, was written by two Cambridge undergraduates (David Braben and Ian Bell). I played Elite in one of its later incarnations – on an Amstrad CPC464. Elite was a space trading and fighting game with revolutionary 3D wireframe graphics and complex gameplay. And it all fitted into 22kb – the absolute maximum memory available on the BBC Micro. The cunning required to build multiple universes in such a small space, and the battles to gain a byte here and a byte there to add another feature are alien to the modern programmers eyes. At the time Acornsoft were publishing quite a few games but Elite was something different: they’d paid for the development which took an unimaginable 18 months or so and when it was released there was a launch event at Alton Towers and the game came out in a large box stuffed with supporting material. All of this was a substantial break with the past. Ultimately the number of copies of Elite sold for the BBC Micro approximately matched the number of BBC Micros sold – an apparent market saturation.
Success continues with the story of Vodaphone – one of the first two players in the UK mobile phone market. The science here is in radio planning – choosing where to place your masts for optimal coverage, Vodaphone bought handsets from Panasonic and base stations from Ericsson. Interestingly Europe and the UK had a lead over the US in digital mobile networks – they agreed the GSM standard which gave instant access to a huge market. Whilst in the US 722 franchises were awarded with no common digital standard.
Moving out of the backroom a little is the story of the Human Genome Project, principally the period after Craig Venter announced he was going to sequence the human genome faster than the public effort then sell it! This effort was stymied by the Wellcome Trust who put a great deal further money into the public effort. Genetic research has a long history in the UK but the story here is one of industrial scale sequencing, quite different from conventional lab research and the power of the world’s second largest private research funder (the largest is currently the Bill & Melinda Gates Foundation).
The final chapter of the book is on the Beagle 2 Mars lander, built quickly, cheaply and with the huge enthusiasm and (unlikely) fund raising abilities of Colin Pillinger. Sadly, as the Epilogue records the lander became a high velocity impactor – nothing was heard from it after it left the Mars orbiter which had brought it from the Earth.
The theme for the book is the innate cunning of the British, but if there’s a lesson to be learnt it seems to be that thinking big is a benefit. Elite, the mobile phone network, the Human Genome Project were the successes from this book. Concorde was a technical wonder but an economic disaster. Black Arrow and Beagle 2 suffered from being done on a shoestring budget.
Overall I enjoyed the Backroom Boys, it reminded me of my childhood with Elite and the coming of the mobile phones. It’s more a celebration than a dispassionate view but there’s no harm in that.