Tag: scraperwiki

A place in the country

This post was first published at ScraperWiki.

Recently Shelter came to us asking for data on house prices across the UK to help them with some research in support of campaign on housing affordability.

This is a challenge we’re well suited to address, in fact a large fraction of the ScraperWiki team have scraped property price data for our own purposes. Usually though we just scrape a local area, using the Zoopla API, but Shelter wanted the whole country. It would be possible to do the whole country by this route but rate-limiting would mean it took a few days. So we spoke nicely to Zoopla who generously lifted the rate-limiting for us, they were also very helpfully in responding to our questions about their API.

The raw data for this job amounted to 2 gigabytes, 34 pieces of information for each of 500,000 properties for sale in the UK in August 2013. The data tell us about the location, the sale price, the property details, the estate agent details and the price history of each property.

As usual in these situations we fired up Tableau to get a look at the data, Tableau is well-suited to this type of database-table shaped data and is responsive for this number of lines of data.

What sort of properties are we looking at here?

We can find out this information from the “property type” field, shown in the chart below which counts the number of properties for sale in each property type category. The most common category is “Detached”, followed by “Flat”.

Property_Type

We can also look at the number of bedrooms. Unsurprisingly the number of bedrooms peaks at about 3 but with significant numbers of properties with 4, 5 and 6 bedrooms. Beyond that there are various guest houses, investment properties, parcels of land for sale with nominal numbers of bedrooms culminating in a 150 bedroom “property” which actually sounds like a village.

What about prices?

This is where things get really interesting. Below is a chart of the number of properties for sale in each price £25k price “bin”, for example the bin marked 475k contains all of the houses priced between £475k and £499,950 – the next bin being labelled 500k containing houses priced from £500k to £525k. We can see that the plot here is jagged, the numbers of properties for sale in each bin does not vary smoothly as the price increases, it jumps up and down. In fact this effect is quite regular, for houses priced over £500k there are fewest for sale at the round numbers £500k, £600k etc most for sale at £575k, £675k and so forth.

PriceHistogram_25k

But this doesn’t just effect the super-wealthy – if we zoom into the lower priced region, making our price bins only £1k there is a similar effect with prices ending 4,5 and 9,0 more frequent than those ending 1, 2, 3 or 6, 7, 8. This is all the psychology of pricing.

Distribution of prices around the country?

We can get a biased view of the population distribution simply by plotting all the property for sale locations. ‘Biased’ because, at the very least, varying economic conditions around the country will bias the number of properties for sale.

Density_map_crop

This looks about right, there are voids in the areas of the country which are sparsely populated such as Scotland, Wales, the Peak District and the Lake District.

Finally, we can look at how prices vary around the country – the map below shows the average house price in a region defined by the “outcode” – the first group of letters in a UK postcode. The colour of the points indicates the average price – darkest blue for the lowest average price (£40k) and darkest red for the highest average price (£500k). The size of the dots shows how many properties are for sale in that area.

House Prices by UK Outcode

I’m grateful to be living in the relatively inexpensive North West of England!

There’s plenty more things to look at in this data, for example – the frequency of street names around the UK and the words used by estate agents to describe properties but that is for another day.

That’s what we found – what would you do?

Property information powered by Zoopla

Book review: The Tableau 8.0 Training Manual – From clutter to clarity by Larry Keller

Tableau 8.0 Training Manual

This review was first published at ScraperWiki.

My unstoppable reading continues, this time I’ve polished off The Tableau 8.0 Training Manual: From Clutter to Clarity by Larry Keller. This post is part review of the book, and part review of Tableau.

Tableau is a data visualisation application which grew out of academic research on visualising databases. I’ve used Tableau Public a little bit in the past. Tableau Public is a free version of Tableau which only supports public data i.e. great for playing around with but not so good for commercial work. Tableau is an important tool in the business intelligence area, useful for getting a quick view on data in databases and something our customers use, so we are interested in providing Tableau integration with the ScraperWiki platform.

The user interface for Tableau is moderately complex, hence my desire for a little directed learning. Tableau has a pretty good set of training videos and help pages online but this is no good to me since I do a lot of my reading on my commute where internet connectivity is poor.

Tableau is rather different to the plotting packages I’m used to using for data analysis. This comes back to the types of data I’m familiar with. As someone with a background in physical sciences I’m used to dealing with data which comprises a couple of vectors of continuous variables. So for example, if I’m doing spectroscopy then I’d expect to get a pair of vectors: the wavelength of light and the measured intensity of light at those wavelengths. Things do get more complicated than this, if I were doing a scattering experiment then I’d get an intensity and a direction (or possibly two directions). However, fundamentally the data is relatively straightforward.

Tableau is crafted to look at mixtures of continuous and categorical data, stored in a database table. Tableau comes with some sample datasets, one of which is sales data from superstores across the US which illustrates this well. This dataset has line entries of individual items sold with sale location data, product and customer (categorical) data alongside cost and profit (continuous) data. It is possible to plot continuous data but it isn’t Tableau’s forte.

Tableau expects data to be delivered in “clean” form, where “clean” means that spreadsheets and separated value files must be presented with a single header line with columns which contain data all of the same type. Tableau will also connect directly to a variety of databases. Tableau uses the Microsoft JET database engine to store it’s data, I know this because for some data unsightly wrangling is required to load data in the correct format. Once data is loaded Tableau’s performance is pretty good, I’ve been playing with the MOT data which is 50,000,000 or so lines, which for the range of operations I tried turned out to be fairly painless.

Turning to Larry Keller’s book, The Tableau 8.0 Training Manual: From Clutter to Clarity, this is one of few books currently available relating to the 8.0 release of Tableau. As described in the title it is a training manual, based on the courses that Larry delivers. The presentation is straightforward and unrelenting; during the course of the book you build 8 Tableau workbooks, in small, explicitly described steps. I worked through these in about 12 hours of screen time, and at the end of it I feel rather more comfortable using Tableau, if not expert. The coverage of Tableau’s functionality seems to be good, if not deep – that’s to say that as I look around the Tableau interface now I can at least say “I remember being here before”.

Some of the Tableau functionality I find a bit odd, for example I’m used to seeing box plots generated using R, or similar statistical package. From Clutter to Clarity shows how to make “box plots” but they look completely different. Similarly, I have a view as to what a heat map looks like and the Tableau implementation is not what I was expecting.

Personally I would have preferred a bit more explanation as to what I was doing. In common with Andy Kirk’s book on data visualisation I can see this book supplementing the presented course nicely, with the trainer providing some of the “why”. The book comes with some sample workbooks, available on request – apparently directly from the author whose email response time is uncannily quick.

Making a ScraperWiki view with R

 

This post was first published at ScraperWiki.

In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.

Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.

Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:

R-view

The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.

The code can all be found in this BitBucket repository.

The viewer is based on the knitr package for R, which generates reports in specified formats (HTML, PDF etc) from a source template file which contains R commands which are executed to generate content. In this case we use Rhtml, rather than the alternative Markdown, which enables us to specify custom CSS and JavaScript to integrate with the ScraperWiki platform.

ScraperWiki tools live in their own UNIX accounts called “boxes”, the code for the tool lives in a subdirectory, ~/tool, and web content in the ~/http directory is displayed. In this project the http directory contains a short JavaScript file, code.js, which by the magic of jQuery and some messy bash shell commands, puts the URL of the SQL endpoint into a file in the box. It also runs a package installation script once after the tool is first installed, the only package not already installed is the ggplot2 package.


function save_api_stub(){
scraperwiki.exec('echo "' + scraperwiki.readSettings().target.url + '" > ~/tool/dataset_url.txt; ')
}
function run_once_install_packages(){
scraperwiki.exec('run-one tool/runonce.R &> tool/log.txt &')
}
$(function(){
save_api_stub();
run_once_install_packages();
});

view raw

code.js

hosted with ❤ by GitHub

The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.

This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.


#!/usr/bin/Rscript
# Script to knit a file 2013-08-08
# Ian Hopkinson
library(knitr)
.libPaths('/home/tool/R/libraries')
render_html()
knit("/home/tool/view.Rhtml",output="/home/tool/http/index.html")

view raw

knitrview.R

hosted with ❤ by GitHub

Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:

  • a function to read the SQL endpoint URL which is dumped into the box by some JavaScript used in the Rhtml template.
  • a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
  • a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.

Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library. 


#!/usr/bin/Rscript
# Script to create r-view 2013-08-14
# Ian Hopkinson
source('scraperwiki_utils.R')
NumberOfTweets<-function(){
query = 'select count(*) from tweets'
number = ScraperWikiSQL(query)
return(number)
}
TweetsHistogram<-function(){
library("ggplot2")
library("scales")
#threshold = 20
bin = 60 # Size of the time bins in seconds
query = 'select created_at from tweets order by created_at limit 40000'
dates_raw = ScraperWikiSQL(query)
posix = strptime(dates_raw$created_at, "%Y-%m-%d %H:%M:%S+00:00")
num = as.POSIXct(posix)
Dates = data.frame(num)
p = qplot(num, data = Dates, binwidth = bin)
# This gets us out the histogram count values
counts = ggplot_build(p)$data[[1]]$count
timeticks = ggplot_build(p)$data[[1]]$x
# Calculate limits, method 1 – simple min and max of range
start = min(num)
finish = max(num)
minor = waiver() # Default breaks
major = waiver()
p = p+scale_x_datetime(limits = c(start, finish ),
breaks = major, minor_breaks = minor)
p = p + theme_bw() + xlab(NULL) + theme(axis.text.x = element_text(angle=45,
hjust = 1,
vjust = 1))
p = p + xlab('Date') + ylab('Tweets per minute') + ggtitle('Tweets per minute (Limited to 40000 tweets in total)')
return(p)
}

view raw

view-source.R

hosted with ❤ by GitHub

So there you go – not the world’s most exciting tool but it shows the way to make live reports on the ScraperWiki platform using R. Extensions to this would be to allow some user interaction, for example by allowing them to adjust the axis limits. This could be done either using JavaScript and vanilla R or using Shiny.

What would you do with R in ScraperWiki? Let me know in the comments below or by email: [email protected]

pdftables – a Python library for getting tables out of PDF files

This post was first published at ScraperWiki.

One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time.

PDF is a page description format, it has no knowledge of the logical structure of a document such as where titles are, or paragraphs, or whether it’s two column format or one column. It just knows where characters are on the page. The plot below shows how characters are laid out for a large table in a PDF file.

AlmondBoard7_LTChar

This makes extracting structured data from PDF a little challenging.

Don’t get me wrong, PDF is a useful format in the right place, if someone sends me a CV – I expect to get it in PDF because it’s a read only format. Send it in Microsoft Word format and the implication is that I can edit it – which makes no sense.

I’ve been parsing PDF files for a few years now, to start with using simple online PDF to text converters, then with pdftohtml which gave me better location data for text and now using the Python pdfminer library which extracts non-text elements and as well as bonding words into sentences and coherent blocks. This classification is shown in the plot below, the blue boxes show where pdfminer has joined characters together to make text boxes (which may be words or sentences). The red boxes show lines and rectangles (i.e. non-text elements).

AlmondBoard7

More widely at ScraperWiki we’ve been processing PDF since our inception with the tools I’ve described above and also the commercial, Abbyy software.

As well as processing text documents such as parliamentary proceedings, we’re also interested in tables of numbers. This is where the pdftables library comes in, we’re working towards making scrapers which are indifferent to the format in which a table is stored, receiving them via the OKFN messytables library which takes adapters to different file types. We’ve already added support to messytables for HTML, now its time for PDF support using our new, version-much-less-than-one pdftables library.

Amongst the alternatives to our own efforts are Mozilla’s Tabula, written in Ruby and requiring the user to draw around the target table, and Abbyy’s software which is commercial rather than open source.

pdftables can take a file handle and tell you which pages have tables on them, it can extract the contents of a specified page as a single table and by extension it can return all of the tables of a document (at the rate of one per page). It’s possible, for simple tables to do this with no parameters but for more difficult layouts it currently takes hints in the form of words found on the top and bottom rows of the table you are looking for. The tables are returned as a list of list of lists of strings, along with a diagnostic object which you can use to make plots. If you’re using the messytables library you just get back a tableset object.

It turns out the defining characteristic of a data scientist is that I plot things at the drop of a hat, I want to see the data I’m handling. And so it is with the development of the pdftables algorithms. The method used is inspired by image analysis algorithms, similar to the Hough transforms used in Tabula. A Hough transform will find arbitrarily oriented lines in an image but our problem is a little simpler – we’re interested in vertical and horizontal rows.

To find these vertical rows and columns we project the bounding boxes of the text on a page onto the horizontal axis ( to find the columns) and the vertical axis to find the rows. By projection we mean counting up the number of text elements along a given horizontal or vertical line. The row and column boundaries are marked by low values, gullies, in the plot of the projection. The rows and columns of the table form high mountains, you can see this clearly in the plot below. Here we are looking at the PDF page at the level of individual characters, the plots at the top and left show the projections. The black dots show where pdftables has placed the row and column boundaries.

AlmondBoard8_projection

pdftables is currently useful for supervised use but not so good if you want to just throw PDF files at it. You can find pdftables on Github and you can see the functionality we are still working on in the issue tracker. Top priorities are finding more than one table on a page and identifying multi-column text layouts to help with this process.

You’re invited to have a play and tell us what you think – [email protected]

Book Review: Clean Code by Robert C. Martin

Clean Code Bookcover

This review was first published at ScraperWiki.

Following my revelations regarding sharing code with other people I thought I’d read more about the craft of writing code in the form of Clean Code: A Handbook of Agile Software Craftmanship by Robert C. Martin.

Despite the appearance of the word Agile in the title this isn’t a book explicitly about a particular methodology or technology. It is about the craft of programming, perhaps encapsulated best by the aphorism that a scout always leaves a campsite tidier than he found it. A good programmer should leave any code they touch in a better state than they found it. Martin has firm ideas on what “better” means.

After a somewhat sergeant-majorly introduction in which Martin tells us how hard this is all going to be, he heads off into his theme.

Martin doesn’t like comments, he doesn’t like switch statements, he doesn’t like flag arguments, he doesn’t like multiple arguments to functions, he doesn’t like long functions, he doesn’t like long classes, he doesn’t like Hungarian* notation, he doesn’t like output arguments…

This list of dislikes generally isn’t unreasonable; for example comments in code are in some ways an anachronism from when we didn’t use source control and were perhaps limited in the length of our function names. The compiler doesn’t care about the comments and does nothing to police them so comments can be actively misleading (Guilty, m’lud). Martin prefers the use of descriptive function and variable names with a clear hierarchical structure to the use of comments.

The Agile origins of the book are seen with the strong emphasis on testing, and Test Driven Development. As a new convert to testing I learnt a couple of things here: clearly written tests being as important as clearly written code, the importance of test coverage (how much of you code is exercised by tests).

I liked the idea of structuring functions in a code file hierarchically and trying to ensure that each function operates at a single layer of abstraction, I’m fairly sold on the idea that a function should do one thing, and one thing only. Although to my mind the difficulty is in the definition of “thing”.

It seems odd to use Java as the central, indeed only, programming language in this book. I find it endlessly cluttered by keywords used in the specification of functions and variables, so that any clarity in the structure and naming that the programmer introduces is hidden in the fog. The book also goes into excruciating detail on specific aspects of Java in a couple of chapters. As a testament to the force of the PEP8 coding standard, used for Python, I now find Java’s prevailing use of CamelCase visually disturbing!

There are a number of lengthy examples in the book, demonstrating code before and after cleaning with a detailed description of the rationale for each small change. I must admit I felt a little sleight of hand was involved here, Martin takes chunks of what he considers messy code typically involving longish functions and breaks them down into smaller functions, we are then typically presented with the highest level function with its neat list of function calls. The tripling of the size of the code in function declaration boilerplate is then elided.

The book finishes with a chapter on “[Code] Smells and Heuristics” which summarises the various “code smells” (as introduced by Martin Fowler in his book Refactoring: Improving the Design of Existing Code) and other indicators that your code needs a cleaning. This is the handy quick reference to the lessons to be learned from the book. 

Despite some qualms about the style, and the fanaticism of it all I did find this an enjoyable read and felt I’d learnt something. Fundamentally I like the idea of craftsmanship in coding, and it fits with code sharing.

*Hungarian notation is the habit of appending letter or letters to variables to indicate their type.