Dr Administrator

Author's posts

Book review: Data Visualization: a successful design process by Andy Kirk

datavisualization_andykirk

This post was first published at ScraperWiki.

My next review is of Andy Kirk’s book Data Visualization: a successful design process. Those of you on Twitter might know him as @visualisingdata, where you can follow his progress around the world as he delivers training. He also blogs at Visualising Data.

Previously in this area, I’ve read Tufte’s book The Visual Display of Quantitative Information and Nathan Yau’s Visualize ThisTufte’s book is based around a theory of effective visualisation whilst Visualize This is a more practical guide featuring detailed code examples. Kirk’s book fits between the two: it contains some material on the more theoretical aspects of effective visualisation as well as an annotated list of software tools; but the majority of the book covers the end-to-end design process.

Data Vizualisation introduced me to Anscombe’s Quartet. The Quartet is four small datasets, eleven (x,y) coordinate pairs in each. The Quartet is chosen so the common statistical properties (e.g. mean values of x and y, standard deviations for same, linear regression coefficients) for each set are identical, but when plotted they look very different. The numbers are shown in the table below.

anscombesdata

Plotted they look like this:

anscombequartetAside from set 4, the numbers look unexceptional. However, the plots look strikingly different. We can easily classify their differences visually, despite the sets having the same gross statistical properties. This highlights the power of visualisation. As a scientist, I am constantly plotting the data I’m working on to see what is going on and as a sense check: eyeballing columns of numbers simply doesn’t work. Kirk notes that the design criteria for such exploratory visualisations are quite different from those highlighting particular aspects of a dataset, more abstract “data art” presentations, or a interactive visualisations prepared for others to use.

In contrast to the books by Tufte and Yau, this book is much more about how to do data visualisation as a job. It talks pragmatically about getting briefs from the client and their demands. I suspect much of this would apply to any design work.

I liked Kirk’s “Eight Hats of data visualisation design” metaphor; which name the skills a visualiser requires: Initiator, Data Scientist, Journalist, Computer Scientist, Designer, Cognitive Scientist, Communicator and Project Manager. In part, this covers what you will require to do data visualisation, but it also gives you an idea of whom you might turn to for help  –  someone with the right hat.

The book is scattered with examples of interesting visualisations, alongside a comprehensive taxonomy of chart types. Unsurprisingly, the chart types are classified in much the same way as statistical methods: in terms of the variable categories to be displayed (i.e. continuous, categorical and subdivisions thereof). There is a temptation here though: I now want to make a Sankey diagram… even if my data doesn’t require it!

In terms of visualisation creation tools, there are no real surprises. Kirk cites Excel first, but this is reasonable: it’s powerful, ubiquitous, easy to use and produces decent results as long as you don’t blindly accept defaults or get tempted into using 3D pie charts. He also mentions the use of Adobe Illustrator or Inkscape to tidy up charts generated in more analysis-oriented packages such as R. With a programming background, the temptation is to fix problems with layout and design programmatically which can be immensely difficult. Listed under programming environments is the D3 Javascript library, this is a system I’m interested in using  –  having had some fun with Protovis, a D3 predecessor.

Data Visualization works very well as an ebook. The figures are in colour (unlike the printed book) and references are hyperlinked from the text. It’s quite a slim volume which I suspect compliments Andy Kirk’s “in-person” courses well.

Book review: The Dinosaur Hunters by Deborah Cadbury

DinosaurHuntersA rapid change of gear for my book reviewing: having spent several months reading “The Eighth Day of Creation” I have completed “The Dinosaur Hunters” by Deborah Cadbury in only a couple of weeks. Is this a bad thing? Yes, and no – it’s been nice to read a book that rattles along at a good pace, is gripping and doesn’t have me leaping to make notes at every page – the downside is that I feel I have consumed a literary snack rather than a meal.

The Dinosaur Hunters covers the initial elucidation of the nature of large animal fossils, principally of dinosaurs, from around the beginning of the 19th century to just after the publication of Darwin’s “Origin of the Species” in 1859. The book is centred around Gideon Mantell (1790-1852) who first described the Iguanodon and was an expert in the geology of the Weald, at the same time running a thriving medical practice in his home town of Lewes. Playing the part of Mantell’s nemesis is Richard Owen (1804-1892), who formally described the group of species, the Dinosauria, and was to be the driving force in the founding of the Natural History Museum in the later years of the 19th century. Smaller parts are played by Mary Anning (1799-1847), fossil collector based in Lyme Regis; William Buckland (1784-1856) who described Megalosaurus – the first of the dinosaurs and spent much of his life trying to reconcile his Christian faith with new geological findings; George Cuvier (1769-1832) the noted French anatomist who related fossil anatomy to modern animal anatomy and identified the existence of extinctions (although he was a catastrophist who saw this as evidence of different epochs of extinction rather than a side effect of evolution); Charles Lyell (1897-1875) a champion of uniformitarianism (the idea that the modern geology is the result of processes visible today continuing over great amounts of time); Charles Darwin (1809-1882) who really needs no introduction, and Thomas Huxley (1825-1895) a muscular proponent of Darwin’s evolutionary theory.

For me a recurring theme was that of privilege and power in science, often this is portrayed as something which disadvantaged women but in this case Mantell is something of a victim too, as was William Smith as described in “The Map that Changed the World”. Mantell was desperate for recognition but held back by his full-time profession as a doctor in a minor town and his faith that his ability would lead automatically to recognition. Owen, on the other hand, with similar background (and prodigious ability) went first to St Bartholomew’s hospital and then the Royal College of Surgeon’s where he appears to have received better patronage but in addition was also brutal and calculating in his ambition. Ultimately Owen over-reached himself in his scheming, and although he satisfied his desire to create a Natural History Museum, in death he left little personal legacy – his ability trumped by his dishonesty in trying to obliterate his opponents.

From a scientific point of view the thread of the book is from the growing understanding of stratigraphy i.e. the consistent sequence of rock deposits through Great Britain and into Europe; the discovery of large fossil animals which had no modern equivalent; the discovery of an increasing range of these prehistoric remnants each with their place in the stratigraphy and the synthesis of these discoveries in Darwin’s theory of evolution. Progress in the intermediate discovery of fossils was slow because in contrast to the the early fossils of marine species such as icthyosaurus and plesiosaurus which were discovered substantially intact later fossils of large land animals were found fragmented in Southern England, which made identifying the overall size of such species and even the numbers of species present in your pile of fossils difficult.

These scientific discoveries collided with a social thread which saw the clergy deeply involved in scientific discovery at the beginning, becoming increasingly discomforted with the account of the genesis of life in Scripture being incompatible with the findings in the stone. This ties in with a scientific community trying to make their discoveries compatible with Scripture and what they perceived to be the will of God with the schism between the two eventually coming to a head by the publication of Darwin’s Origin of Species.

Occasionally the author drops into a bit of first person narration which I must admit to finding a bit grating, perhaps because for people long dead it is largely inference. I’d have been very happy to have chosen this book for a long journey or a holiday, I liked the wider focus on a story rather than an individual.

References

My Evernotes

Book review: The Eighth Day of Creation by Horace Freeland Judson

EighthDayMy reading moves seamlessly from the origins of cosmology (in Koestler’s Sleepwalkers) to the origins of molecular biology in “The Eighth Day of Creation” by Horace Freeland Judson. The book covers the revolution in biology starting with the elucidation of the structure of DNA through to how this leads to the synthesis, by organisms, of proteins – this covers a period from just before the Second World War to the early 1960s although in the Epilogue and Afterwords. Judson comments on the period up to the mid-nineties. Although the book does provide basic information on the core concepts (What is DNA? What is a protein?), I suspect it requires a degree of familiarity with these ideas to make much sense on a casual reading – the same applies to this blog post.

The first third or so of the book covers the elucidation of the structure of DNA. Three groups were working on this problem – that of Linus Pauling in the US, Franklin and Wilkins at Kings College in London and Crick and Watson in Cambridge. Key to the success of Crick and Watson was their collaboration: a willingness to talk to people who knew stuff they needed to know, and piecing the bits together. The structural features of their model were the helix form (this wasn’t news), specific and strong hydrogen bonding between bases, and the presence of two DNA chains (running in opposite directions). On the whole this wasn’t a new story to me, although I wasn’t familiar with the surrounding work which established DNA as the genetic material. Judson returns to the part Rosalind Franklin in the discovery in one of the Afterwords. It has been said that Franklin was greatly wronged over the discovery of DNA, but Judson does not hold this view and I tend to agree with him. The core of the problem is that the Nobel Prize is not awarded posthumously, and with her death at 37 from cancer, Franklin therefore missed out. Watson’s book The Double Helix was a rather personalised view of the characters involved most of whom were alive to carry out damage limitation, whilst Franklin was not – so here she was poorly treated but by Watson rather than a whole community of scientists. Perhaps the thing that said the most to me about the situation is that after she was diagnosed with cancer she stayed with Cricks at their home.

In parallel with the elucidation of the structure of the DNA work had been ongoing with understanding protein synthesis and genetics in viruses and bacteria. This included both how information was coded into DNA, with much effort expended in trying to establish overlapping codes. There are 20 amino acids and four bases in DNA, so three base pairs are required to specify an amino acid if the amino acid sequence is to be unconstrained but it was conceivable that two consecutive amino acids are coded by fewer than 6 base pairs but in this case there is a restriction on the possible amino acid sequences. This area was initiated by the physicist, George Gamow. I struggle a bit to see how it gained so much traction, this type of model was quickly ruled out by consideration of the amino acid sequences that we being established for proteins at the time. It turns out that amino acids are coded by three consecutive base pairs with redundancy (so several different base pair triplets code for the same amino acid). Also covered was the mechanism by which data passed from DNA to the ribosomes where protein synthesis takes place, important here are adaptor molecules which carry the appropriate amino acid to the site of synthesis.

Compared to the structure of DNA this work was a long difficult slog, involving intricate experiments with bacteria, bacteriophage viruses, bacterial sex, ultracentrifugation, chromatography and radiolabelling.

The final part of the book is on the elucidation of the structure of proteins, this was done using x-ray crystallography with the very first clear scattering patterns measured in the 1930s and the first full elucidation made in the late fifties. X-ray crystallography of proteins, containing many thousands of atoms is challenging. Fundamentally there is a issue, the “phase problem”, which means you don’t have quite enough information to determine the structure from the scattering pattern. This issue was resolved by heavy atom labelling, here you try to chemically attach a heavy atom such as mercury to your protein then compare the scattering pattern of this modified protein with that of the unmodified protein, which resolves the phase problem. Nowadays measuring the thousands of spots in an x-ray scattering pattern and carrying out the thousands and thousands of calculations required to resolve the structure is relatively straightforward but in the early days it was a massive manual labour.

As well as resolving structure a key discovery was made regarding the mode of action of proteins: essentially they work as adaptors between chemical distinct systems – when a molecule binds to one site on a protein it effects the ability of another type of molecule to bind to another site on the protein through changes in the protein structure induced by the first molecule’s binding. This feature opens up huge possibilities for cell biology – in the absence of this feature interactions between chemical systems can only occur if the participants in those systems interact with each other chemically.

It isn’t something I’d really appreciated properly but molecular biologists are quite organised in the organisms that they generally agree to work on. The truth is that there are uncountably many viruses and so to aid the progress of science one needs to select which ones to study: E. Coli, the T series bacteriophages, C. Elegans, D. Melanogaster and more recently the zebrafish, they almost play the part of an extra author.

Molecular biology was apparently dominated by physicists, I must admit I found this confusing in the past but Judson highlights the field as defined by its practioners: biochemistry is about energy and matter (and typically small molecules), molecular biology is about information (and typically macromolecules) – a more natural home for physicists.

I found the first and third parts an enjoyable read, my scientific background is in scattering so the technical material was at least familiar the central section on genetics I found fascinating but a bit of a slog. I’m somewhat in awe of the complexity of the experiments (and their apparent difficulty).

Looking back on my earlier book reviews, I read my comment on R.J. Evan’s book on historiography that history is a literary exercise as well as anything else, as a trained scientist this was something of an alien concept but in common with Koestler’s book the style of this book shines through.

 

Footnotes

My Evernotes

Enterprise data analysis and visualization

This post was first published at ScraperWiki.

The topic for today is a paper[1] by members of the Stanford Visualization Group on interviews with data analysts, entitled “Enterprise Data Analysis and Visualization: An Interview Study”. This is clearly relevant to us here at ScraperWiki, and thankfully their analysis fits in with the things we are trying to achieve.

The study is compiled from interviews with 35 data analysts across a range of business sectors including finance, health care, social networking, marketing and retail. The respondents are harvested via personal contacts and predominantly from Northern California; as such it is not a random sample, we should consider results to be qualitatively indicative rather than quantitatively accurate.

The study identifies three classes of analyst whom they refer to as Hackers, Scripters and Application Users. The Hacker role was defined as those chaining together different analysis tools to reach a final data analysis. Scripters, on the other hand, conducted most of their analysis in one package such as R or Matlab and were less likely to scrape raw data sources. Scripters tended to carry out more sophisticated analysis than Hackers, with analysis and visualisation all in the single software package. Finally, Application Users worked largely in Excel with data supplied to them by IT departments. I suspect a wider survey would show a predominance of Application Users and a relatively smaller relative population of Hackers.

The authors divide the process of data analysis into 5 broad phases Discovery – Wrangle – Profile – Model – Report. These phases are generally self explanatory – wrangling is the process of parsing data into a format suitable for further analysis and profiling is the process of checking the data quality and establishing fully the nature of the data.

This is all summarised in the figure below, each column represents an individual so we can see in this sample that Hackers predominate.

table

At the bottom of the table are identified the tools used, divided into database, scripting and modeling types. Looking across the tools in use SQL is key in databases, Java and Python in scripting, R and Excel in modeling. It’s interesting to note here that even the Hackers make quite heavy use of Excel.

The paper goes on to discuss the organizational and collaborative structures in which data analysts work, frequently an IT department is responsible for internal data sources and the productionising of analysis workflows.

Its interesting to highlight the pain points identified by interviewees and interviewers:

  • scripts and intermediate data not shared;
  • discovery and wrangling are time consuming and tedious processes;
  • workflows not reusable;
  • ingesting semi-structured data such as log files is challenging.

Why does this happen? Typically the wrangling scraping phase of the operation is ad hoc, the scripts used are short, practioners don’t see this as their core expertise and they’ll typically draw from a limited number of data sources meaning there is little scope to build generic tools. Revision control tends not to be used, even for the scripting tools where it is relatively straightforward perhaps because practioners have not been introduced to revision control or simply see the code they write as too insignificant to bother with revision control.

ScraperWiki has its roots in data journalism, open source software and community action but the tools we build are broadly applicable, as one of the respondents to the survey said:

“An analyst at a large hedge fund noted their organization’s ability to make use of publicly available but poorly-structured data was their primary advantage over competitors.”

References

[1] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer, “Enterprise Data Analysis and Visualization : An Interview Study,” IEEE Trans. Vis. Comput. Graph., vol. 18(12), (2012), pp. 2917–2926.

More Shiny – Sony Vaio T13 laptop with Windows 8

SonyVaioT13

I thought I’d mix together a review of my shiny new laptop (a Sony Vaio T13) with one of Windows 8 which came pre-installed on the laptop.

The laptop

Six years after buying my last laptop I have replaced it with another Sony Vaio. At the time I bought the first one I didn’t think I would do this, my old Sony Vaio (VGN-SZ2M) is a nice machine but it was infested with Sony cruftware which added little functionality and what it did try to add didn’t seem to work and  the couriers Sony selected left it with a neighbour without asking whether this was appropriate. It had a weird black plastic finish which was probably described as "carbon fibre". It’s worked fine although I found the 80GB hard disk a little cramped and as the years went by it felt slower and slower when compared to the other machines I use.

After poking around extensively I finally decided on another Sony Vaio, other contenders were the Lenovo Yoga 13 (limited availability and would that hinge really hold out?), the Acer Aspire S7 (more pricey for a poorer config and apparently no option for a big conventional drive) and offerings from Samsung, Toshiba and Dell – the bar for being a contender in this limited set was the touchscreen. I did look at non-touchscreen variants too and particularly liked the look of the Lenovo IdeaPad U410.

Having decided, I bought direct from Sony getting to get a bit more configuration flexibility adding 8GB RAM, an i7 processor and going for the 32GB SSD/500GB conventional hard drive combination, this is an ultrabook class laptop with a 13.3" touchscreen, no optical drive, and Windows 8. I liked the idea of getting a pure SSD system but the price Sony charges for the upgrade is about double the price of the highly regarded Samsung 840 Pro series SSDs so maybe I’ll be opening the thing up soon. It weighs 1.5kg which is light but not the lightest in this class, I decided on a touchscreen since it didn’t seem to add hugely to the cost and it isn’t something you can retrofit should the desire arise.

It is a very beautiful thing: brushed metal with chromed highlights, and in its pristine state it comes out of hibernate very quickly.

Compared to my old laptop it has the same footprint, unsurprising since the screen is the same size. The keyboard is narrower though, losing a column of keys, but the device is about half the thickness  – having lost the optical drive.

I worried a little about the monolithic touchpad with no separate left and right mouse buttons but it has a positive click in these two locations so I’ve not noticed the lack of separate buttons.

The screen resolution may be a little deficient (1366×768) but it is comparable with most of the laptops in its class and I intend using it on an external monitor anyway.

There is a small infestation of cruftware, featuring an update centre which seems to struggle to provide the necessary bandwith and an update-able electronic manual which I can’t seem to get hold of because the instructions for downloading it take you around in a loop.

As if in pique my old desktop PC failed shortly after I got the new Vaio so I’m using it as my sole computer for now, this works fine except it is a pain to install CD based software for various bits of hardware (quite why my video camera shipped with 4 CDs of software I don’t understand).

So overall – the Sony Vaio gets an A, a tick or some number of stars between 5 and 10.

Windows 8

image

I have a bit of a habit for getting computers with brand new Microsoft operating systems, although fortunately I skipped Windows Vista. Windows 8 takes a bit of getting used to, the best way of thinking about it is as Windows 7 with a mobile phone interface dropped on top of it. This is both good and bad. Personally I rather like Windows 7, and I’m also rather pleased with the Android-based touchscreen interface on my HTC Desire phone but the combination of the two is a bit disturbing.

Actually "a bit disturbing" is wrong "crap" would be better, the new style apps follow very different UI rules from conventional Windows apps and major in form over content – for example the pre-installed twitter app, although pretty and swooshy with the touchscreen is utterly useless as a twitter client. Not only does it have limited functionality but in order to view anything but the briefest of timelines you need to flap your arm about like a deranged semaphorist. The twitter app from twitter is marginally more functional but looks like the portrait aspect ratio phone screen placed in the middle of a wide laptop screen. Comparing my Android phone and tablet it strikes me few people have cracked scaling apps from phone to tablet size screens, let alone all the way to laptop screen sizes.

Live tiles offer interesting possibilities but they are constrained to one of two sizes, and I’ve yet to find one which does anything particularly interesting.

Microsoft is very keen for developers to write the mobile phone style apps, at one point the (free) Express version of Visual Studio was only going to allow developers to target the mobile phone style apps.

The only real redeeming feature of the new Windows 8 additions is that, once you’ve accepted the concept, the Start screen is better than the old Start button.

Not so long ago I would have "struck down upon thee with great vengeance and furious anger those who" touched the screen of any device I owned, these days I’m a little bit more relaxed: I find the touchscreen a nice adjunct to more conventional input but I have a smeary screen now.

It seems to me there are a limited number of things you need to "get" about an operating system in order to use it with a peaceful mind, for Windows 7 a big one was that you didn’t need to go stumbling through a cascade of entries in the Start menu – you just start typing the name of your desired application into the search box and it was revealed fairly promptly. Start typing when you are on the Windows 8 Start screen and you launch just such a search – how the hell you’re supposed to know this is a mystery to me. And this seems like one of the core problems with Windows 8 – there are some nice little interface features but there’s no way you would guess they were there or find them by accident.

Windows 8 is keen for you to login using a Microsoft account, it is possible to just use a local account but I thought “in for a penny, in for a pound” and went ahead and set one up. Interestingly you can see the benefit of this approach when using Google Chrome, when I installed Chrome it automatically installed the plugins I have on other PCs, my autocorrect settings and so forth – instantly I was at home. I guess this is the longer term plan for Windows 8. It also wants me to have an xbox account to buy music and video.

Some hints for new users of Windows 8:

  • To shift tiles around on the Start page, hold them and drag then up or down initially (not left-right), to zoom out drag them towards the bottom of the screen;
  • If you use Google Chrome as your default browser the title bar icons (minimise, maximise and close) disappear, to fix this don’t use it as your default browser;
  • There exist both new style and old style applications, some things are available in both formats, for example Dropbox. The new-style apps resemble phone apps but offer limited functionality;
  • New-style apps don’t have an "exit" button, simply navigate away from them as you would a phone app;
  • The Start screen replaces the Start menu on the old Windows 7 desktop, to search for anything just start typing!
  • Windows 8 style apps cannot play MPEG2 files, this is only available for Windows 8 Pro with added Windows Media Centre. Windows Media Player will play them (suitable codecs installed – I used Shark007) and VLC player works fine.

On the last item: this seems a bit bonkers – the video app on the mobile-style interface can see your video library perhaps containing an unrelenting series of videos of your growing child which will almost inevitably be in MPEG2 format as a default so crippling this functionality seems a bit stupid.

Bottom line: Windows 8 is very pretty and the Start screen is, in my view, better than the old Windows 7 Start menu once you’ve got your head around it. The idea of putting a mobile phone interface, with mobile phone style apps, on top of a desktop interface is stupid – my opinion on this may change if I see some apps that are optimised for laptops. Mobile interfaces such as iOS and Android are optimised for consumption which is fine, but many people will still be getting PC class devices to do “work” and for the main the new mobile interface in Windows 8 gets in the way of that.

And now to install Ubuntu on it… a process so exciting I have made it the subject of a second blog post.