Author's posts
May 19 2013
Three years of electronic books
It is customary to write reviews of things when they are fresh and new. This blog post is a little different in the sense that it is a review of 3 years of electronic book usage.
My entry to e-books was with the Kindle: a beautiful, crisp display, fantastic battery life but with a user interface which lagged behind smartphones of the time. More recently I have bought a Nexus 7 tablet on which I use the Kindle app, and very occasionally use my phone to read.
Primarily my reading on the Kindle has been fiction with a little modern politics, and the odd book on technology. I have tried non-fiction a couple of times but have been disappointed (the illustrations come out poorly). Fiction works well because there are just words, you start reading at the beginning of the book and carry on to the end in a linear fashion. The only real issue I’ve had is that sometimes, with multiple devices and careless clicking it’s possible to lose your place; I found this more of a problem than with a physical book. My physical books I bookmark with railtickets, very occasionally they fall out but then I have a rough memory of where they were in the book via the depth axis, and flicking rapidly through a book is easy (i.e. pages per second) – the glimpse of chapter start, the layout of paragraphs is enough to let you know where you are.
There are other times when the lack of a physical presence is galling: my house is full of books, many have migrated to the loft on the arrival of Thomas, my now-toddling son. But many still remain, visible to visitors. Slightly shamefaced I admit to a certain pretention in my retention policy: Ulysses found shelf space for many years whilst science fiction and fantasy made a rapid exit. Nonfiction is generally kept. Books tell you of a persons interests, and form an ad hoc lending library. In the same way as there beaver’s dam is part of its extended phenotype, my books are part of mine. With ebooks we largely lose this display function, I can publish my reading on services like Shelfari but this is not the same a books on shelves. The same applies for train reading, with a physical book readers can see what each other is reading.
Another missing aspect of physicality, I’ve read Reamde by Neal Stephenson a book of a thousand pages, and JavaScript: the Good Parts by Douglas Crockford, only a hundred and fifty or so. The Kindle was the same size for both books! Really it needs some sort of inflatable bladder which inflates to match the number of pages in the book, perhaps deflating as you made your way through the book.
Regular readers of this blog will know I blog what I read, at least for non-fiction. My scheme for this is to read, taking notes in Evernote. This doesn’t work so well on either the Kindle or Kindle app, too much switching between apps. But the Kindle has a notes and highlighting! I hear you say. Yes, it does but it would appear digital rights management (DRM) has reduced its functionality – I can’t share my notes easily and, if your book is stored as a personal document because it didn’t come from the Kindle store then you can’t even share notes across devices. This is a DRM issue because I suspect functionality is limited because without limits you could simply highlight a whole book, or perhaps copy and paste it. And obviously I can’t lend my ebook in the same way as I lend my physical books, or even donate them to charity when I’m finished with them.
This isn’t to say ebooks aren’t really useful – I can take plenty of books on holiday to read without filling my luggage, and I can get them at the last minute. I have a morbid fear of Running Out of Things To Read, which is assuaged by my ebook. In my experience, technology books at the cheaper / lower volume end of the market are also better electronically (and actually the ones I’ve read are relatively unencumbered by DRM), i.e. they come in colour whilst their physical counterparts do not.
Overall verdict: you can pack a lot of fiction onto an ebook but I’ve been using physical books for 40 years and humans have been using them for thousands of years and it shows!
May 09 2013
Book review: Interactive Data Visualization for the web by Scott Murray
This post was first published at ScraperWiki.
Next in my book reading, I turn to Interactive Data Visualisation for the web by Scott Murray (@alignedleft on twitter). This book covers the d3 JavaScript library for data visualisation, written by Mike Bostock who was also responsible for the Protovis library. If you’d like a taster of the book’s content, a number of the examples can also be found on the author’s website.
The book is largely aimed at web designers who are looking to include interactive data visualisations in their work. It includes some introductory material on JavaScript, HTML, and CSS, so has some value for programmers moving into web visualisation. I quite liked the repetition of this relatively basic material, and the conceptual introduction to the d3 library.
I found the book rather slow: on page 197 – approaching the final fifth of the book – we were still making a bar chart. A smaller effort was expended in that period on scatter graphs. As a data scientist, I expect to have several dozen plot types in that number of pages! This is something of which Scott warns us, though. d3 is a visualisation framework built for explanatory presentation (i.e. you know the story you want to tell) rather than being an exploratory tool (i.e. you want to find out about your data). To be clear: this “slowness” is not a fault of the book, rather a disjunction between the book and my expectations.
From a technical point of view, d3 works by binding data to elements in the DOM for a webpage. It’s possible to do this for any element type, but practically speaking only Scaleable Vector Graphics (SVG) elements make real sense. This restriction means that d3 will only work for more recent browsers. This may be a possible problem for those trapped in some corporate environments. The library contains a lot of helper functions for generating scales, loading up data, selecting and modifying elements, animation and so forth. d3 is low-level library; there is no PlotBarChart function.
Achieving the static effects demonstrated in this book using other tools such as R, Matlab, or Python would be a relatively straightforward task. The animations, transitions and interactivity would be more difficult to do. More widely, the d3 library supports the creation of hierarchical visualisations which I would struggle to create using other tools.
This book is quite a basic introduction, you can get a much better overview of what is possible with d3 by looking at the API documentation and the Gallery. Scott lists quite a few other resources including a wide range for the d3 library itself, systems built on d3, and alternatives for d3 if it were not the library you were looking for.
I can see myself using d3 in the future, perhaps not for building generic tools but for custom visualisations where the data is known and the aim is to best explain that data. Scott quotes Ben Schniederman on this regarding the structure of such visualisations:
overview first, zoom and filter, then details on demand
Apr 23 2013
Book review: JavaScript: The Good Parts by Douglas Crockford
This post was first published at ScraperWiki.
This week I’ve been programming in JavaScript, something of a novelty for me. Jealous of the Dear Leader’s automatically summarize tool I wanted to make something myself, hopefully a future post will describe my timeline visualising tool. Further motivations are that web scraping requires some knowledge of JavaScript since it is a key browser technology and, in its prototypical state, the ScraperWiki platform sometimes requires you to launch a console and type in JavaScript to do stuff.
I have two books on JavaScript, the one I review here is JavaScript: The Good Parts by Douglas Crockford – a slim volume which tersely describes what the author feels the best bits of JavaScript, incidently highlighting the bad bits. The second book is the JavaScript Bible by Danny Goodman, Michael Morrison, Paul Novitski, Tia Gustaff Rayl which I bought some time ago, impressed by its sheer bulk but which I am unlikely ever to read let alone review!
Learning new programming languages is easy in some senses: it’s generally straightforward to get something to happen simply because core syntax is common across many languages. The only seriously different language I’ve used is Haskell. The difficulty with programming languages is idiom, the parallel is with human languages: the barrier to making yourself understood in a language is low, but to speak fluently and elegantly needs a higher level of understanding which isn’t simply captured in grammar. Programming languages are by their nature flexible so it’s quite possible to write one in the style of another – whether you should do this is another question.
My first programming language was BASIC, I suspect I speak all other computer languages with a distinct BASIC accent. As an aside, Edsger Dijkstra has said:
[…] the teaching of BASIC should be rated as a criminal offence: it mutilates the mind beyond recovery.
– so perhaps there is no hope for me.
JavaScript has always felt to me a toy language: it originates in a web browser and relies on HTML to import libraries but nowadays it is available on servers in the form of node.js, has a wide range of mature libraries and is very widely used. So perhaps my prejudices are wrong.
The central idea of JavaScript: The Good Parts is to present an ideal subset of the language, the Good Parts, and ignore the less good parts. The particular bad parts of which I was glad to be warned:
- JavaScript arrays aren’t proper arrays with array-like performance, they are weird dictionaries;
- variables have function not block scope;
- unless declared inside a function variables have global scope;
- there is a difference between the equality == and === (and similarly the inequality operators). The short one coerces and then compares, the longer one does not, and is thus preferred.
I liked the railroad presentation of syntax and the section on regular expressions is good too.
Elsewhere Crockford has spoken approvingly of CoffeeScript which compiles to JavaScript but is arguably syntactically nicer, it appears to hide some of the bad parts of JavaScript which Crockford identifies.
If you are new to JavaScript but not to programming then this is a good book which will give you a fine start and warn you of some pitfalls. You should be aware that you are reading about Crockford’s ideal not the code you will find in the wild.
Apr 17 2013
Book review: R in Action by Peter Harrington
This post was first published at ScraperWiki.
This is a review of Robert I. Kabacoff’s book R in Action which is a guided tour around the statistical computing package, R.
My reasons for reading this book were two-fold: firstly, I’m interested in using R for statistical analysis and visualisation. Previously I’ve used Matlab for this type of work, but R is growing in importance in the data science and statistics communities; and it is a better fit for the ScraperWiki platform. Secondly, I feel the need to learn more statistics. As a physicist my exposure to statistics is relatively slight – I’ve wondered why this is the case and I’m eager to learn more.
In both cases I see this book as an atlas for the area rather than an A-Z streetmap. I’m looking to find out what is possible and where to learn more rather than necessarily finding the detail in this book.
R in Action follows a logical sequence of steps for importing, managing, analysing, and visualising data for some example cases. It introduces the fundamental mindset of R, in terms of syntax and concepts. Central of these is the data frame – a concept carried over from other statistical analysis packages. A data frame is a collection of variables which may have different types (continuous, categorical, character). The variables form the columns in a structure which looks like a matrix – the rows are known as observations. A simple data frame would contain the height, weight, name and gender of a set of people. R has extensive facilities for manipulating and reorganising data frames (I particularly like the sound of melt in the reshape library).
R also has some syntactic quirks. For example, the dot (.) character, often used as a structure accessor in other languages, is just another character as far as R is concerned. The $ character fulfills a structure accessor-like role. Kabacoff sticks with the R user’s affection for using <- as the assignment operator instead of = which is what everyone else uses, and appears to work perfectly well in R.
R offers a huge range of plot types out-of-the-box, with many more a package-install away (and installing packages is a trivial affair). Plots in the base package are workman-like but not the most beautiful. I liked the kernel density plots which give smoothed approximations to histogram plots and the rug plots which put little ticks on the axes to show where the data in the body of that plot fall. These are all shown in the plot below, plotted from example data included in R.
The ggplot2 package provides rather more beautiful plots and seems to be the choice for more serious users of R.
The statistical parts of the book cover regression, power analysis, methods for handling missing data, group comparison methods (t-tests and ANOVA), and principle component and factor analysis, permutation and bootstrap methods. I found it a really useful survey – enough to get the gist and understand the principles with pointers to more in-depth information.
One theme running through the book, is that there are multiple ways of doing almost anything in R, as a result of its rich package ecosystem. This comes to something of a head with graphics in the final section: there are 4 different graphics systems with overlapping functionality but different syntax. This collides a little with the Matlab way of doing things where there is the one true path provided by Matlab alongside a fairly good, but less integrated, ecosystem of user-provided functionality.
R is really nice for this example-based approach because the base distribution includes many sample data sets with which to play. In addition, add-on packages often include sample data sets on which to experiment with the tools they provide. The code used in the book is all relatively short; the emphasis is on the data and analysis of the data rather than trying to build larger software objects. You can do an awful lot in a few lines of R.
As an answer to my statistical questions: it turns out that physics tends to focus on Gaussian-distributed, continuous variables, while statistics does not share this focus. Statistics is more generally interested in both categorical and continuous variables, and distributions cannot be assumed. For a physicist, experiments are designed where most variables are fixed, and the response of the system is measured as just one or two variables. Furthermore, there is typically a physical theory with which the data are fitted, rather than a need to derive an empirical model. These features mean that a physicist’s exposure to statistical methods is quite narrow.
Ultimately I don’t learn how to code by reading a book, I learn by solving a problem using the new tool – this is work in progress for me and R, so watch this space! As a taster, just half a dozen lines of code produced the appealing visualisation of twitter profiles shown below:
(Here’s the code: https://gist.github.com/IanHopkinson/5318354)
Apr 17 2013
Book review: Machine Learning in Action by Peter Harrington
This post was first published at ScraperWiki.
Machine learning is about prediction, and prediction is a valuable commodity. This sounds pretty cool and definitely the sort of thing a data scientist should be into, so I picked up Machine Learning in Action by Peter Harrington to get an overview of the area.
Amongst the examples covered in this book are:
- Given that a customer bought these items, what other items are they likely to want?
- Is my horse likely to die from colic given these symptoms?
- Is this email spam?
- Given that these representatives have voted this way in the past, how will they vote in future?
In order to make a prediction, machine learning algorithms take a set of features and a target for a training set of examples. Once the algorithm has been trained, it can take new feature sets and make predictions based on them. Let’s take a concrete example: if we were classifying birds, the birds’ features would include the weight, size, colour and so forth and the target would be the species. We would train the algorithm on an initial set of birds where we knew the species, then we would measure the features of unknown birds and submit these to the algorithm for classification.
In this case, because we know the target – a species of bird – the algorithms we use would be referred to as “supervised learning.” This contrasts “unsupervised learning,” where the target is unknown and the algorithm is seeking to make its own classification. This would be equivalent to the algorithm creating species of birds by clustering those with similar features. Classification is the prediction of categories (i.e. eye colour, like/dislike), alternatively regression is used to predict the value of continuous variables (i.e. height, weight).
Machine learning in Action is divided into four sections that cover key elements and “additional tools” which includes algorithms for dimension reduction and MapReduce – a framework for parallelisation. Dimension reduction is the process of identifying which features (or combination of features) are essential to a problem.
Each section includes Python code that implements the algorithms under discussion and these are applied to some toy problems. This gives the book the air of Numerical Recipes in FORTRAN, which is where I cut my teeth on numerical analysis. The mixture of code and prose is excellent for understanding exactly how an algorithm works, but its better to use a library implementation in real life.
The algorithms covered are:
- Classification – k-Nearest Neighbours, decision trees, naive Bayes, logistic regression, support vector machines, and AdaBoost;
- Regression – linear regression, locally weighted linear regression, ridge regression, tree-based regression;
- Unsupervised learning – k-means clustering, apriori algorithm, FP-growth;
- Additional tools – principle component analysis and singular value decomposition.
Prerequisites for this book are relatively high: it assumes fair Python knowledge, some calculus, probability theory and matrix algebra.
I’ve seen a lot of mention of MapReduce without being clear what it was. Now I am more clear: it is a simple framework for carrying out parallel computation. Parallel computing has been around quite some time, the problem has always been designing algorithms that accommodate parallelisation (i.e. allow problems to be broken up into pieces which can be solved separately and then recombined). MapReduce doesn’t solve this problem but gives a recipe for what is required to run on commodity compute cluster.
As Harrington says: do you need to run MapReduce on a cluster to solve your data problem? Unless you are an operation on the scale of Google or Facebook then probably not. Current, commodity desktop hardware is surprisingly powerful particularly when coupled with subtle algorithms.
This book works better as an eBook than paper partly because the paper version is black and white and some figures require colour but the programming listings are often images and so the text remains small.