Tag: data visualisation

Book Review: Visualize This by Nathan Yau

9780470944882 cover.inddThis book review is of Nathan Yau’s “Visualize This: The FlowingData Guide to Design, Visualization and Statistics”. It grows out of Yau’s blog: flowingdata.com, which I recommend, and also his experience in preparing graphics for The New York Times, amongst others.

The book is a run-through of pragmatic methods in visualisation, focusing on practical means of achieving ends rather more abstract design principles for data visualisation; if you want that then I recommend Tufte’s “The Visual Display of Quantitative Information”.

The book covers a bit of data scraping, extracting useful numerical data from disparate sources, as Yau comments this is the thing that takes the time in this type of activity. It also details methods for visualising time series data, proportions, geographic data and so forth.

The key tools involved are the R and Python programming languages; I already have these installed in the form of R Studio and Python(x,y), distributions which provide an environment that looks like the Matlab one with which I have long been familiar with but which sadly is somewhat expensive for a hobby programmer. Alongside this are the freely available Processing language and the Protovis Javascript library which are good for interactive, online visualisations, and the commercial packages Adobe Illustrator, for vector graphic editing, and Adobe Flash Builder for interactive web graphics. Again these are tools I find out of my range financially for my personal use although Inkscape seems to be a good substitute for Illustrator.

With no prior knowledge of Flash and no Flash Builder, I found the sections on Flash a bit bewildering. It also highlights how perhaps this will be a book very distinctively of its time, with Apple no longer supporting Flash on iPhone its quite possible that the language will die out. And I notice on visiting the Protovis website that this is no longer under development: the authors have moved on to D3.js, Openzoom which is also mentioned is no longer supported. Python has been around for sometime now and is the lightweight language of choice for many scientists, similarly R has been around for a while and is increasing in popularity.

You won’t learn to program from this book: if you can already program you’ll see that R is a nice language in which to quickly make a wide range of plots. If you can’t program then you may be surprised how few commands R requires to produce impressive results. As someone who is a beginner in R, the examples are a nice tour of what is possible and some little tricks, such as the fact that plot functions don’t take data frames as arguments: you need to extract arrays.

As well as programming the book also includes references to a range of data sources and online tools, for example colorbrewer2.org – a tool for selecting colour schemes, and links to the various mapping APIs.

Readers of this blog will know that I am an avid data scraper and visualiser myself, and in a sense this book is an overview of that way of working – in fact I see I referenced flowingdata in my attempts to colour in maps (here).

The big thing I learned from the book in terms of workflow is the application of a vector graphics package, such as Adobe Illustrator or, Inkscape, to tidy up basic graphics produced in R. This strikes me as a very good idea, I’ve spent many a frustrating hour trying to get charts looking just right in the programming or plotting language of my choice and now I discover that the professionals use a shortcut! A quick check shows that R exports to PDF, which Inkscape can read.

Stylistically the book is exceedingly chatty, including even the odd um and huh, which helps make it quick and easy read although is a little grating. Many of the examples are also available over on flowingdata.com, although I notice that some are only accessible for paid membership. You might want to see the book as a way of showing your appreciation for the blog in physical and monetary form.

Look out for better looking visualisations from me in the future!

Book Review: The Visual Display of Quantitative Information by Edward R. Tufte

 

tufteThe Visual Display of Quantitative Information” by Edward R. Tufte is a classic in the field of data graphics which I’ve been meaning to read for a while, largely because the useful presentation of data in graphic form is a core requirement for a scientist who works with experimental data. This is both for ones own edification, helping to explore data, and also to communicate with an audience.

There’s been something of a resurgence in quantitative data graphics recently with the Gapminder project led by Hans Gosling, and the work of David McCandless and Nathan Yau at FlowingData.

 

The book itself is quite short but beautifully produced. It starts with a little history on the “data graphic”, by “data graphic” Tufte specifically means a drawing that is intended to transmit data about quantitative information in contrast to a diagram which might be used to illustrate a method or facilitate a calculation. On this definition data graphics developed surprisingly late, during the 18th century. Tufte cites in particular work by William Playfair, who was an engineer and political economist who is credited with the invention of line chart, bar chart and pie chart which he used to illustrate economic data. There appears to have been a fitful appearance of what might have been a data graphic in the 10th century but to be honest it more has the air of a schematic diagram.

Also referenced are the data maps of Charles Joseph Minard, the example below shows the losses suffered by Napoleon’s army in it’s 1812 Russian campaign. The tan line shows the army’s advance on Moscow, it’s width proportional to the number of men remaining. The black line shows their retreat from Moscow. Along the bottom is a graph showing the temperature of the cold Russian winter at dates along their return.

800px-MinardInterestingly adding data to maps happened before the advent of the more conventional x-y plot, for example in Edmund Halley’s map of 1686 showing trade winds and monsoons.

Next up is “graphic integrity”: how graphics can be deceptive, this effect is measured using a Lie Factor: the size of the effect shown in graphic divided by the size of the effect in data. Particularly heroic diagrams achieve Lie Factors as large as 59.4. Tufte attributes much of this not to malice but to the division of labour in a news office where graphic designers rather than the owners and explainers of the data are responsible for the design of graphics and tend to go for the aesthetically pleasing designs rather than quantitatively accurate design.

 

Tufte then introduces his core rules, based around the idea of data-ink – that proportion of the ink on a page which is concerned directly with showing quantitative data:

  • Above all else show the data
  • Maximize the data-ink ratio
  • Erase non-data-ink
  • Erase redundant date-ink
  • Revise and edit.

A result of this is that some of the elements of graph which you might consider essential, such as the plot axes, are cast aside and replaced by alternatives. For example the dash-dot plot where instead of solid axes dashes are used which show a 1-D projection of the data:

ddp

Or the range-frame plot where the axes are truncated at the limits of the data, actually to be fully Tufte the axes labels would be made at the ends of the data range, not to some rounded figure:

range

Both of these are examples are from Adam Hupp’s etframe library for Python. Another route to making Tufte-approved data graphics is by using the Protovis library which was designed very specifically with Tufte’s ideas in mind.

Tufte describes non-data-ink as “chartjunk”, several things attract his ire – in particular the moiré effect achieved by patterns of closely spaced lines used for filling areas, neither is he fond of gridlines except of the lightest sort. He doesn’t hold with colour or patterning in graphics, preferring shades of grey throughout. His argument against colour is that there is no “natural” sequence of colours which link to quantitative values.

What’s striking is that the styles recommended by Tufte are difficult to achieve with standard Office software, and even for the more advanced graphing software I use the results he seeks are not the out-of-the-box defaults and take a fair bit of arcane fiddling to reach.  Not only this, some of his advice contradicts the instructions of learned journals on the production of graphics.

Two further introductions I liked were Chernoff faces which use the human ability to discriminate faces to load a graph with meaning, and sparklines – tiny inline graphics showing how a variable varies in time without any of the usual graphing accoutrements: – in this case one I borrowed from Joe Gregorio’s BitWorking.

In the end Tufte has given me some interesting ideas on how to present data, in practice I fear his style is a little too austere for my taste.There’s a quote attributed to Blaise Pascal:

I would have written a shorter letter, but I did not have the time.

I suspect the same is true of data graphics.

Footnote

Mrs SomeBeans has been referring to Tufte as Tufty, who UK readers of a certain age will remember well.