This review was first published at ScraperWiki.
In the mixed environment of ScraperWiki we make use of a broad variety of tools for data analysis. Data Science at the Command Line by Jeroen Janssens covers tools available at the Linux command line for doing data analysis tasks. The book is divided thematically into chapters on Obtaining, Scrubbing, Modeling, Interpreting Data with “intermezzo” chapters on parameterising shell scripts, using the Drake workflow tool and parallelisation using GNU Parallel.
The original motivation for the book was a desire to move away from purely GUI based approaches to data analysis (I think he means Excel and the Windows ecosystem). This is a common desire for data analysts, GUIs are very good for a quick look-see but once you start wanting to repeat analysis or even repeat visualisation they become more troublesome. And launching Excel just to remove a column of data seems a bit laborious. Windows does have its own command line, PowerShell, but it’s little used by data scientists. This book is about the Linux command line, examples are all available on a virtual machine populated with all of the tools discussed in the book.
The command line is at its strongest with the early steps of the data analysis process, getting data from places, carrying out relatively minor acts of tidying and answering the question “does my data look remotely how I expect it to look?”. Janssens introduces the battle tested tools sed, awk, and cut which we use around the office at ScraperWiki. He also introduces jq (the JSON parser), this is a more recent introduction but it’s great for poking around in JSON files as commonly delivered by web APIs. An addition I hadn’t seem before was csvkit which provides a suite of tools for processing CSV at the command line, I particularly like the look of csvstat. csvkit is a Python tool and I can imagine using it directly in Python as a library.
The style of the book is to provide a stream of practical examples for different command line tools, and illustrate their application when strung together. I must admit to finding shell commands deeply cryptic in their presentation with chunks of options effectively looking like someone typing a strong password. Data Science is not an attempt to clear the mystery of these options more an indication that you can work great wonders on finding the right incantation.
Next up is the Rio tool for using R at the command line, principally to generate plots. I suspect this is about where I part company with Janssens on his quest to use the command line for all the things. Systems like R, ipython and the ipython notebook all offer a decent REPL (read-evaluation-print-loop) which will convert seamlessly into an actual program. I find I use these REPLs for experimentation whilst I build a library of analysis functions for the job at hand. You can write an entire analysis program using the shell but it doesn’t mean you should!
Weka provides a nice example of smoothing the command line interface to an established package. Weka is a machine learning library written in Java, it is the code behind Data Mining: Practical Machine Learning Tools and techniques. The edges to be smoothed are that the bare command line for Weka is somewhat involved since it requires a whole pile of boilerplate. Janssens demonstrates nicely how to do this by developing automatically autocompletion hints for the parts of Weka which are accessible from the command line.
The book starts by pitching the command line as a substitute for GUI driven applications which is something I can agree with to at least some degree. It finishes by proposing the command line as a replacement for a conventional programming language with which I can’t agree. My tendency would be to move from the command line to Python fairly rapidly perhaps using ipython or ipython notebook as a stepping stone.
Data Science at the Command Line is definitely worth reading if not following religiously. It’s a showcase for what is possible rather than a reference book as to how exactly to do it.