Author's posts
Aug 23 2013
Making a ScraperWiki view with R
This post was first published at ScraperWiki.
In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.
Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.
Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:
The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.
The code can all be found in this BitBucket repository.
The viewer is based on the knitr package for R, which generates reports in specified formats (HTML, PDF etc) from a source template file which contains R commands which are executed to generate content. In this case we use Rhtml, rather than the alternative Markdown, which enables us to specify custom CSS and JavaScript to integrate with the ScraperWiki platform.
ScraperWiki tools live in their own UNIX accounts called “boxes”, the code for the tool lives in a subdirectory, ~/tool, and web content in the ~/http directory is displayed. In this project the http directory contains a short JavaScript file, code.js, which by the magic of jQuery and some messy bash shell commands, puts the URL of the SQL endpoint into a file in the box. It also runs a package installation script once after the tool is first installed, the only package not already installed is the ggplot2 package.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
function save_api_stub(){ | |
scraperwiki.exec('echo "' + scraperwiki.readSettings().target.url + '" > ~/tool/dataset_url.txt; ') | |
} | |
function run_once_install_packages(){ | |
scraperwiki.exec('run-one tool/runonce.R &> tool/log.txt &') | |
} | |
$(function(){ | |
save_api_stub(); | |
run_once_install_packages(); | |
}); |
The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.
This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/Rscript | |
# Script to knit a file 2013-08-08 | |
# Ian Hopkinson | |
library(knitr) | |
.libPaths('/home/tool/R/libraries') | |
render_html() | |
knit("/home/tool/view.Rhtml",output="/home/tool/http/index.html") |
Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:
- a function to read the SQL endpoint URL which is dumped into the box by some JavaScript used in the Rhtml template.
- a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
- a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.
Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/Rscript | |
# Script to create r-view 2013-08-14 | |
# Ian Hopkinson | |
source('scraperwiki_utils.R') | |
NumberOfTweets<-function(){ | |
query = 'select count(*) from tweets' | |
number = ScraperWikiSQL(query) | |
return(number) | |
} | |
TweetsHistogram<-function(){ | |
library("ggplot2") | |
library("scales") | |
#threshold = 20 | |
bin = 60 # Size of the time bins in seconds | |
query = 'select created_at from tweets order by created_at limit 40000' | |
dates_raw = ScraperWikiSQL(query) | |
posix = strptime(dates_raw$created_at, "%Y-%m-%d %H:%M:%S+00:00") | |
num = as.POSIXct(posix) | |
Dates = data.frame(num) | |
p = qplot(num, data = Dates, binwidth = bin) | |
# This gets us out the histogram count values | |
counts = ggplot_build(p)$data[[1]]$count | |
timeticks = ggplot_build(p)$data[[1]]$x | |
# Calculate limits, method 1 – simple min and max of range | |
start = min(num) | |
finish = max(num) | |
minor = waiver() # Default breaks | |
major = waiver() | |
p = p+scale_x_datetime(limits = c(start, finish ), | |
breaks = major, minor_breaks = minor) | |
p = p + theme_bw() + xlab(NULL) + theme(axis.text.x = element_text(angle=45, | |
hjust = 1, | |
vjust = 1)) | |
p = p + xlab('Date') + ylab('Tweets per minute') + ggtitle('Tweets per minute (Limited to 40000 tweets in total)') | |
return(p) | |
} |
So there you go – not the world’s most exciting tool but it shows the way to make live reports on the ScraperWiki platform using R. Extensions to this would be to allow some user interaction, for example by allowing them to adjust the axis limits. This could be done either using JavaScript and vanilla R or using Shiny.
What would you do with R in ScraperWiki? Let me know in the comments below or by email: [email protected]
Aug 02 2013
Photographing Liverpool
I’ve been working in Liverpool for a few months now, I take the Merseyrail train into Liverpool Central and then walk up the hill to ScraperWiki’s offices which are next door to the Liverpool Metropolitan Cathedral aka “Paddy’s Wigwam”.
The cathedral was built in the 1960s, I rather like it. It looks to me like part of a set from a futuristic sci-fi film, maybe Gattacca or Equilibrium. Or some power collection or communication device, which in a way I suppose it is.
To be honest the rest of my usual walk up the Brownlow Hill is coloured by a large, fugly carpark and a rather dodgy looking pub. However, these last few weeks the Merseyrail’s Liverpool Loop has been closed to re-lay track so I’ve walked across town from the James’ Street station, giving me the opportunity to admire some of Liverpool’s other architecture.
As an aside, it turns out that Merseyrail is the second oldest underground urban railway in the world, opening in 1886 and also originally running on steam power according to wikipedia, which seems to contradict Christian Wolmar in his book on the London Underground, which I recently reviewed. (Wolmar states the London Underground is the only one to have run on steam power.
Returning to architecture, I leave James’ Street station via the pedestrian exit on Water Street, there is a lift up onto James’ Street but I prefer the walk. As I come out there is a glimpse of the Royal Liver Building, on the waterfront.
Just along the road is Liverpool Town Hall, for some reason it’s offset slightly from the centre of Castle Street which spoils the vista a little.
Down at the other end of Castle Street we find the Queen Victoria Monument, she stands in Derby Square in front of the rather unattractive Queen Elizabeth II Law Courts.
On the way I pass the former Adelphi Bank building, now a Cafe Nero. I like the exuberant decoration and the colour of the stone and domes.
Raise your eyes from ground level and you see more decoration at the roof line of the buildings on Castle Street:
Once I’ve passed the Victoria Monument it’s a long straight walk down Lord Street then Church Street which has a mixture of buildings, many quite modern but some a bit older, often spoiled by the anachronistic shop fronts at street-level.
I quite like this one at 81-89 Lord Street but it’s seen better days, it used to look like this. It looks like it used to have a spectacular interior.
Further along, on Church Street, there is a large M&S in a fine building.
By now I’ve almost reached my normal route into work from Liverpool Central station, just around the corner on Renshaw Street is Grand Central Hall, which started life as a Methodist church.
It’s a crazy looking building, the buddleia growing out of the roof make me think of one of J.G. Ballard’s novels.
We’re on the final straight now, heading up Mount Pleasant towards the Metropolitan Cathedral. Looking back we can see the Radio City Tower, actually we can see the Radio City Tower from pretty much anywhere in Liverpool.
A little before we reach the Metropolitan Cathedral there is the YMCA on Mount Pleasant, another strange Victorian Gothic building.
I struggled to get a reasonable photograph of this one, I was using my 28-135mm lens on a Canon 600D for this set of photos. This is a good walking around lens but for photos of buildings in dense city environments the 10-22mm lens is better for its ridiculously wide angle view – handy for taking pictures of all of a big building when you are standing next to it!
So maybe next week I’ll head out with the wide angle lens and apply some of the rectilinear correction I used on my Chester photographs.
Jul 29 2013
pdftables – a Python library for getting tables out of PDF files
This post was first published at ScraperWiki.
One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time.
PDF is a page description format, it has no knowledge of the logical structure of a document such as where titles are, or paragraphs, or whether it’s two column format or one column. It just knows where characters are on the page. The plot below shows how characters are laid out for a large table in a PDF file.
This makes extracting structured data from PDF a little challenging.
Don’t get me wrong, PDF is a useful format in the right place, if someone sends me a CV – I expect to get it in PDF because it’s a read only format. Send it in Microsoft Word format and the implication is that I can edit it – which makes no sense.
I’ve been parsing PDF files for a few years now, to start with using simple online PDF to text converters, then with pdftohtml which gave me better location data for text and now using the Python pdfminer library which extracts non-text elements and as well as bonding words into sentences and coherent blocks. This classification is shown in the plot below, the blue boxes show where pdfminer has joined characters together to make text boxes (which may be words or sentences). The red boxes show lines and rectangles (i.e. non-text elements).
More widely at ScraperWiki we’ve been processing PDF since our inception with the tools I’ve described above and also the commercial, Abbyy software.
As well as processing text documents such as parliamentary proceedings, we’re also interested in tables of numbers. This is where the pdftables library comes in, we’re working towards making scrapers which are indifferent to the format in which a table is stored, receiving them via the OKFN messytables library which takes adapters to different file types. We’ve already added support to messytables for HTML, now its time for PDF support using our new, version-much-less-than-one pdftables library.
Amongst the alternatives to our own efforts are Mozilla’s Tabula, written in Ruby and requiring the user to draw around the target table, and Abbyy’s software which is commercial rather than open source.
pdftables can take a file handle and tell you which pages have tables on them, it can extract the contents of a specified page as a single table and by extension it can return all of the tables of a document (at the rate of one per page). It’s possible, for simple tables to do this with no parameters but for more difficult layouts it currently takes hints in the form of words found on the top and bottom rows of the table you are looking for. The tables are returned as a list of list of lists of strings, along with a diagnostic object which you can use to make plots. If you’re using the messytables library you just get back a tableset object.
It turns out the defining characteristic of a data scientist is that I plot things at the drop of a hat, I want to see the data I’m handling. And so it is with the development of the pdftables algorithms. The method used is inspired by image analysis algorithms, similar to the Hough transforms used in Tabula. A Hough transform will find arbitrarily oriented lines in an image but our problem is a little simpler – we’re interested in vertical and horizontal rows.
To find these vertical rows and columns we project the bounding boxes of the text on a page onto the horizontal axis ( to find the columns) and the vertical axis to find the rows. By projection we mean counting up the number of text elements along a given horizontal or vertical line. The row and column boundaries are marked by low values, gullies, in the plot of the projection. The rows and columns of the table form high mountains, you can see this clearly in the plot below. Here we are looking at the PDF page at the level of individual characters, the plots at the top and left show the projections. The black dots show where pdftables has placed the row and column boundaries.
pdftables is currently useful for supervised use but not so good if you want to just throw PDF files at it. You can find pdftables on Github and you can see the functionality we are still working on in the issue tracker. Top priorities are finding more than one table on a page and identifying multi-column text layouts to help with this process.
You’re invited to have a play and tell us what you think – [email protected]
Jul 25 2013
Book review: The Subterranean Railway by Christian Wolmar
To me the London underground is an almost magically teleportation system which brings order to the chaos of London. This is because I rarely visit London and know it only via Harry Beck’s circuit diagram map of the underground. To find out more about the teleporter, I have read The Subterranean Railway by Christian Wolmar.
London’s underground system was the first in the world, it predated any others by nearly 40 years. This had some drawbacks, for the first 30 years of its existence it ran exclusively using steam engines which are not good in an enclosed, underground environment. In fact travel in the early years of the Underground sounds really rather grim, despite its success.
The context for the foundation of the Underground was the burgeoning British rail network, it had started with one line between Manchester and Liverpool in 1830 by 1850 the country had a system spanning the country. The network did not penetrate to the heart of London, it had been stopped by a combination of landowner interests and expense. This exclusion was enshrined in the report of the 1846 Royal Commission on Metropolis Railway Termini. This left London with an ever-growing transport problem, now increased by the railway’s ability to get people to the perimeter of the city but no further.
The railways were the largest human endeavours since Roman times, as well as the engineering challenges there were significant financial challenges in raising capital and political challenges in getting approval. This despite the fact the the railway projectors were exempted from the restrictions on raising capital from groups of more than five people introduced after the South Seas Bubble.
The first underground line, the Metropolitan, opened in 1863 it ran from Paddington to Farringdon – it had been 20 years in the making, although construction only took 3 years. The tunnels were made by the cut-and-cover method, which works as described – a large trench is dug, the railway built in the bottom and then covered over. This meant the tunnels were relatively shallow, mainly followed the line of existing roads and involved immense disruption on the surface.
In 1868 the first section of the District line opened, this was always to be the Metropolitan’s poorer relative but would form part of the Circle line, finally completed in 1884 despite the animosity between James Staats Forbes and Edward Watkin – the heads of the respective companies at the time. It’s worth noting that it wasn’t until 1908 that the first London Underground maps were published; in its early days the underground “system” was the work of disparate private companies who were frequently at loggerheads and certainly not focussed on cooperating to the benefit of their passengers.
The underground railways rarely provided the returns their investors were looking for but they had an enormous social impact, for the first time poorer workers in the city could live out of town in relatively cheap areas and commute in, the railway companies positively encouraged this. The Metropolitan also invested in property in what are now the suburbs of London, areas such as Golders Green were open fields before the underground came. This also reflects the expansion of the underground into the surrounding country.
The first deep line, the City and South London was opened in 1890, it was also the first electric underground line. The deep lines were tunnelled beneath the city using the tunnelling shield developed by Marc Brunel, earlier in the 19th century. Following the first electrification the District and Metropolitan lines eventually electrified their lines, although it took some time (and a lot of money). The finance for the District line came via the American Charles Tyson Yerkes, who would generously be described as a colourful character, engaging in financial engineering which we likely imagine is a recent invention.
Following the First World War the underground was tending towards a private monopoly, government was looking to invest to make work and ultimately the underground was nationalised, at arms length, to form London Transport in 1933, led by the same men (Lord Ashfield and Frank Pick) who had run the private monopoly.
The London underground reached its zenith in the years leading up to the Second World War, gaining its identity (roundel, font and iconic map) and forming a coherent, widespread network. After the war it was starved of funds, declining – overtaken by the private car. Further lines were added such as the Victoria and Jubilee lines but activity was much reduced from the early years.
More recently it has seen something of a revival with the ill-fated Public-Private Partnership running into the ground, but not before huge amounts of money had been spent, substantially on improvements. As I write, the tunnelling machines are building Crossrail.
I felt the book could have done with a construction timeline, something like this on wikipedia (link), early on there seems to be a barrage of new line openings, sometimes not in strictly chronological order and to someone like me, unfamiliar with London it is all a bit puzzling. Despite this The Subterranean Railway is an enjoyable read.
Jul 22 2013
Book Review: Clean Code by Robert C. Martin
This review was first published at ScraperWiki.
Following my revelations regarding sharing code with other people I thought I’d read more about the craft of writing code in the form of Clean Code: A Handbook of Agile Software Craftmanship by Robert C. Martin.
Despite the appearance of the word Agile in the title this isn’t a book explicitly about a particular methodology or technology. It is about the craft of programming, perhaps encapsulated best by the aphorism that a scout always leaves a campsite tidier than he found it. A good programmer should leave any code they touch in a better state than they found it. Martin has firm ideas on what “better” means.
After a somewhat sergeant-majorly introduction in which Martin tells us how hard this is all going to be, he heads off into his theme.
Martin doesn’t like comments, he doesn’t like switch statements, he doesn’t like flag arguments, he doesn’t like multiple arguments to functions, he doesn’t like long functions, he doesn’t like long classes, he doesn’t like Hungarian* notation, he doesn’t like output arguments…
This list of dislikes generally isn’t unreasonable; for example comments in code are in some ways an anachronism from when we didn’t use source control and were perhaps limited in the length of our function names. The compiler doesn’t care about the comments and does nothing to police them so comments can be actively misleading (Guilty, m’lud). Martin prefers the use of descriptive function and variable names with a clear hierarchical structure to the use of comments.
The Agile origins of the book are seen with the strong emphasis on testing, and Test Driven Development. As a new convert to testing I learnt a couple of things here: clearly written tests being as important as clearly written code, the importance of test coverage (how much of you code is exercised by tests).
I liked the idea of structuring functions in a code file hierarchically and trying to ensure that each function operates at a single layer of abstraction, I’m fairly sold on the idea that a function should do one thing, and one thing only. Although to my mind the difficulty is in the definition of “thing”.
It seems odd to use Java as the central, indeed only, programming language in this book. I find it endlessly cluttered by keywords used in the specification of functions and variables, so that any clarity in the structure and naming that the programmer introduces is hidden in the fog. The book also goes into excruciating detail on specific aspects of Java in a couple of chapters. As a testament to the force of the PEP8 coding standard, used for Python, I now find Java’s prevailing use of CamelCase visually disturbing!
There are a number of lengthy examples in the book, demonstrating code before and after cleaning with a detailed description of the rationale for each small change. I must admit I felt a little sleight of hand was involved here, Martin takes chunks of what he considers messy code typically involving longish functions and breaks them down into smaller functions, we are then typically presented with the highest level function with its neat list of function calls. The tripling of the size of the code in function declaration boilerplate is then elided.
The book finishes with a chapter on “[Code] Smells and Heuristics” which summarises the various “code smells” (as introduced by Martin Fowler in his book Refactoring: Improving the Design of Existing Code) and other indicators that your code needs a cleaning. This is the handy quick reference to the lessons to be learned from the book.
Despite some qualms about the style, and the fanaticism of it all I did find this an enjoyable read and felt I’d learnt something. Fundamentally I like the idea of craftsmanship in coding, and it fits with code sharing.
*Hungarian notation is the habit of appending letter or letters to variables to indicate their type.