Tag: source control

Book review: Effective Computation in Physics by Anthony Scopatz & Kathryn D. Huff

ecipThis next review, of “Effective Computation in Physics” by Anthony Scopatz & Kathryn D. Huff, arose after a brief discussion on twitter with Mike Croucher after my review of “High Performance Python” by Ian Micha Gorelick and Ian Ozsvald. This in the context of introducing students, primarily in the sciences, to programming and software development.

I use the term “software development” deliberately. Scientists have been taught programming (badly, in my view*) for many years. Typically they are given a short course in the first year of their undergraduate training, where they are taught the crude mechanics of a programming language (typically FORTRAN, C, Matlab or Python). They are then left to it, perhaps taking up projects requiring significant coding as final year projects or in PhDs. The thing they have lacked is the wider skillset around programming – what you might call “software development”. The value of this is two-fold – firstly, it is a good training for a scientist to have for careers in science. Secondly, the wider software industry is full of scientists, providing students with a good grounding in this field is no bad thing for their future employability.

The book covers in at least outline all the things a scientist or engineer needs to know about software development. It is inspired by the Software Carpentry and The Hacker Within programmes.

The restriction to physics in the title seems needless to me. The material presented is mostly applicable to any science, and those working in the digital humanities, undertaking programming work. The examples have a physics basis but not to any great depth, and the decorative historical anecdotes are all physics based. Perhaps the only exception to this is the chapter on HDF5 which is a specialised data storage system, some coverage of SQL databases would make a reasonable substitute for a more general course. The chapter on parallel computing could likewise be dropped for a wider audience.

The book is divided into four broad sections. Including in these are chapters on:

  • Command line operations;
  • Programming in Python;
  • Build systems, version control, debugging and testing;
  • Documentation, publication, collaboration and licensing;

Command line operations are covered in two chunks, firstly in the basic navigation of the file system and files followed by a second chapter on “Regular Expressions” which covers find, grep, sed and awk – at a very basic level.

The introduction to Python is similarly staged with initial chapters covering the fundamentals of the core language, with sufficient detail and explanation that I learnt some new things**. Further chapters introduce core Python libraries for data analysis including NumPy, Pandas and matplotlib.

Beyond these core chapters on Python those on version control, debugging and testing are a welcome addition. Our dearest wish at ScraperWiki, a small software company where I worked until recently, was that new recruits and interns would come with at least some knowledge and habit for using source control (preferably Git). It is also nice to see some wider discussion of GitHub and the culture of Pull Requests and issue tracking. Systematic testing is also a useful skill to have, in fact my experience has been that formal testing is most useful for those most physics-like functions.

The final section covers documentation, publication and licensing. I found the short chapter on licensing rather useful, I’ve been working on some code to analyse LIDAR data and have made it public on GitHub, which helpfully asks which license I would like to use. As it turns out I chose the MIT license and this seems to be the correct one for the application. On publication the authors are Latex evangelists but students can chose to ignore their monomania on this point. Latex has a cult-like following in physics which I’ve never understood. I have written papers in Latex but much prefer Microsoft Word for creating documents, although Google Docs is nice for collaborative work. The view that a source control repository issue tracker might work for collaboration beyond coding is optimistic unless academics have changed radically in the last few years.

I’d say the only thing lacking was any mention of pair programming, although to be fair that is more a teaching method than course material. I found I learnt most when I had a goal of my own to work towards, and I had the opportunity to pair with people with more knowledge than I had. Actually, pairing with someone equally clueless in a particular technology can work pretty well.

There is a degree to which the book, particularly in this section strays into a fantasy of how the authors wish computational physics was undertaken, rather than describing how it is actually undertaken.

To me this is the ideal “Software development for scientists” undergraduate text, it is opinionated in places and I occasionally I found the style grating but nevertheless it covers the right bases.

*I’m happy to say this since I taught programming badly to physics undergraduates some years ago!

**People who know my Python skills will realise this is not an earthshattering claim.