Tag: data science

Book Review: Scala for the Impatient by Cay S. Horstmann

scala_for_impatientI thought I should learn a new language, and Scala seemed like a good choice so I got Scala for the Impatient by Cay S. Horstmann.

Scala is a functional programming language which supports object orientation too. I’m attracted to it for a number of reasons. Firstly, I’m using or considering using a number of technologies which are based on Java – such as Elasticsearch, Neo4j and Spark. Although there are bindings to my favoured language, Python, for Spark in particular I feel a second class citizen. Scala, running as it does on the Java Virtual Machine, allows you to import Java functions easily and so gives better access to these systems.

I’m also attracted to Scala because it is rather less verbose than Java. It feels like some of the core aspects of the language ecosystem (like the dependency manager and testing frameworks) have matured rapidly although the range of available libraries is smaller than that of older languages.

Scala for the Impatient gets on with providing details of the language without much preamble. Its working assumption is that you’re somewhat familiar with Java and so concepts are explained relative to Java. I felt like it also made an assumption that you knew about the broad features of the language, since it made some use of forward referencing – where features are used in an example before being explained somewhat later in the book.

I must admit programming in Scala is a bit of a culture shock after Python. Partly because its compiled rather than interpreted, although the environment does what it can to elide this difference – Scala has an REPL (read-evaluate-print-loop) which works in the background by doing a quick compile. This allows you to play around with the language very easily. The second difference is static typing – Scala is a friendly statically typed language in the sense that if you initialise something with a string value then it doesn’t force you to tell it you want this to be a string. But everything does have a very definite type. It follows the modern hipster style of putting the type after the symbol name (i.e var somevariablename: Int = 5 ), as in Go rather than before, as in earlier languages (i.e int somevariablename = 5).

You have to declare new variables as either var or val. Variables (var) are mutable and values (val) are immutable. It strikes me that static typing and this feature should fix half of my programming errors which in a dynamically typed language are usually mis-spelling variable names, changing something as a side effect and putting the wrong type of thing into a variable – usually during I/O.

The book starts with chapters on basic control structures and data types, to classes and objects and collection data types. There are odd chapters on file handling and regular expressions, and also on XML processing which is built into the language, although it does not implement the popular xpath query language for XML. There is also a chapter on the parsing of formal grammars.

I found the chapter on futures and promises fascinating, these are relatively new ways to handle concurrency and parallelism which I hadn’t been exposed to before, I notice they have recently been introduced to Python.

Chapters on type parameters, advanced types and implicit types had me mostly confused although the early parts were straightforward enough. I’d heard of templating classes and data strctures but as someone programming mainly in a dynamically typed languages I hadn’t any call for them. I turns out templating is a whole lot more complicated than I realised!

My favourite chapter was the one on collections – perhaps because I’m a data scientists, and collections are where I put my data. Scala has a rich collection of collections and methods operating on collections. It avoids the abomination of the Python “dictionary” whose members are not ordered, as you might expect. Scala calls such a data structure a HashMap.

It remains to be seen whether reading, once again, chapters on object-oriented programming will result in me writing object-oriented programs. It hasn’t done in the past.

Scala for the Impatient doesn’t really cover the mechanics of installing Scala on your system or the development environment you might use but then such information tends to go stale fast and depends on platform. I will likely write a post on this, since installing Scala and its build tool, sbt, behind a corporate proxy was a small adventure.

Googling for further help I found myself at the Scala Cookbook by Alvin Alexander quite frequently. The definitive reference book for Scala is Programming in Scala by Martin Odersky, Lex Spoon and Bill Venners. Resorting to my now familiar technique of searching the acknowledgements for women working in the area, I found Susan Potter whose website is here.

Scala for the Impatient is well-named, it whistles through the language at a brisk pace, assuming you know how to program. It highlights the differences with Java, and provides you with the vocabulary to find out more.

Book review: BDD in Action by John Ferguson Smart

bddinactionBack to technical reading with this book BDD in Action by John Ferguson Smart. BDD stands for Behaviour Driven Development, a relatively new technique for specifying software requirements.

Behaviour Driven Development is an evolution of the Agile software development methodology which has project managers writing “stories” to describe features, and sees developers writing automated tests to guide the writing of code – this part is called “test driven development”. In behaviour driven development the project manager, along with their colleagues who may be business analysts, testers and developers, write structured, but still “natural language”, acceptance criteria which are translated into tests that are executed automatically.

Behaviour Driven Development was invented by Dan North whilst at Thoughtworks in London, there he wrote the first BDD test framework, JBehave and defined the language of the tests, called Gherkin. Gherkin looks like this:

Scenario: Register for online banking

Given that bill wants to register for online banking

When he submits his application online

Then his application should be created in a pending state

And he should be sent a PDF contract to sign by email

The scenario describes the feature that we are trying to implement, and the Given-When-Then steps describe the test, Given is the setup, When is an action and Then is the expected outcome. The developer writes so called “step definitions” which map to these steps and the BDD test framework arranges the running of the tests and the collection of results. There is a bit more to Gherkin than the snippet above encompasses, it can provide named variables and values, and even tables of values and outputs to be fed to the tests.

Subsequently BDD frameworks have been written for other languages, such as Lettuce for Python, SpecFlow for .NET and Cucumber for Ruby. There are higher level tools such as Thucydides and Cucumber Reports. These tools can be used to generate so-called “Living Documentation” where the documentation is guaranteed to describe the developed application because it describes the tests around which the application was built. Of course it is possible to write poorly considered tests and thus poor living documentation but the alternative is writing documentation completely divorced from code.

Reading the paragraph above I can see that for non-developers the choice of names may seem a bit whacky but that’s a foible of developers. I still have no idea how to pronounce Thucydides and my spelling of it is erratic.

BDD in Action describes all of this process including the non-technical parts of writing the test scenarios, and the execution of those scenarios using appropriate tools. It takes care to present examples across the range of languages and BDD frameworks. This is quite useful since it exposes some of how the different languages work and also shows the various dialects of Gherkin. BDD in Action also covers processes such as continuous integration and integration testing using Selenium.

As someone currently more on the developer side of the fence, rather than the (non-coding) project manager BDD seems to add additional layers of complexity since now I need a library to link my BDD style tests to actual code, and whilst I’m at it I may also include a test-runner library and a library for writing unit tests in BDD style (such as spock).

I’ve had some experience of managing Agile development and with that hat on BDD feels more promising, in principle I can now capture capabilities and feature requirements with my stakeholders in a language that my developers can run as code. Ideally BDD makes the project manager and stakeholders discuss the requirements in the form of explicit examples which the developers will code against.

BDD in Action has reminded why I haven’t spent much time using Java: everything is buried deep in directories, there are curly brackets everywhere and lots of boilerplate!

I suspect I won’t be using BDD in my current work but I’ll keep it in the back of my mind for when the need arises. Even without the tooling it is a different way of talking to stakeholders about requirements. From a technical point of view I’m thinking of switching my test naming conventions to methods like test_that_this_function_does_something arranged in classes named like WhenIWantToDoThisThing, as proposed in the text.

In keeping with my newfound sensitivity to the lack of women in technical writing, I scanned the acknowledgements for women and found Liz Keogh – who is also mentioned a number of times in the text as an experienced practioner of BDD. You can find Liz Keogh here. I did look for books on BDD written by women but I could find none.

If you want to know what Behaviour Driven Design is about, and you want to get a feel for how it looks technically in practice (without a firm commitment to any development language or libraries) then BDD in Action is a good place to start.

Book review: Working effectively with legacy code by Michael C. Feathers

legacy_codeWorking effectively with legacy code by Michael C. Feathers is one of the programmer’s classic texts. I’d seen it lying around the office at ScraperWiki but hadn’t picked it up since I didn’t think I was working with legacy code. I returned to read it having found it at the top of the list of recommended programming books from Stackoverflow at dev-books. Reading the description I learnt that it’s more a book about testing than about legacy code. Feathers defines legacy code simply as code without tests, he is of the Agile school of software development for whom tests are central.

With this in mind I thought it would be a useful read for me to improve my own code with the application of better tests and perhaps incidentally picking up some object-oriented style, in which I am currently lacking.

Following the theme of my previous blog post on women authors I note that there are two women authors in the 30 books on the dev-books list. It’s interesting that a number of books in the style of Working Effectively explicitly reference women as project managers, or testers in the text, i.e part of the team – I take this as a recognition that there exists a problem which needs to be addressed and this is pretty much the least you can do. However, beyond the family, friends and publishing team the acknowledgements mention one women in a lengthy list.

The book starts with a general overview of the techniques it will introduce, including the tools used to address them. These come down to testing frameworks and the refactoring tools found in many IDEs. The examples in the book are typically written in C++ or Java. I particularly liked the introduction of the ideas of the “seam”, a place where behaviour can be changed without editing the code and the “enabling point” – the place where a change can be made at that seam. A seam may be a class that can be replaced by another one, or a value altered. In desperate cases (in C) the preprocessor can be used to invoke test-time changes in the executed code.

There are then a set of chapters that answer questions that a legacy code-ridden developer might have such as:

  • I can’t get this class into a test harness
  • How do I know that I’m not breaking anything?
  • I need to make a change. What methods should I test?

This makes the book easy to navigate, if not a bit inelegant. It seems to me that the book addresses two problems in getting suitably sized pieces of code into a test harness. One of these is breaking the code into suitable sized pieces by, for example, extracting methods. The second is gaining independence of the pieces of code such that they can be tested without building a huge infrastructure up to support them.

Although I’ve not done any serious programming in Java or C++ I felt I generally understood the examples presented. My favoured language is Python, and the problems I tackle tend to be more amenable to a functional style of programming. Despite this I think many of the methods described are highly relevant – particularly those describing how to break down monster functions. The book is highly pragmatic, it accepts that the world is not full of applications in which beautiful structure diagrams are replicated by beautiful code.

There are differences between these compiled object-oriented languages and Python though. C#, Java, and C++ all have a collection of keywords (like public, private, protected, static and final) which control who can see what methods exist on a class and whether they can be over-ridden or replaced. These features present challenges for bringing legacy code under test. Python, on the other hand, has a “gentleman’s agreement” that method names starting with an underscore are private, but that’s it, and there are no mechanisms to prevent you using these “private” functions! Similarly, pretty much any method in Python can be over-ridden by monkey-patching. That’s to say if you don’t like a function in an imported library you can simply overwrite it with your own version after you’ve imported the library. This is not necessarily a good thing. A second difference is that Python comes with a unit testing framework and a mocking library rather than them being functionality which is third-party added. Although to be fair, the mocking library in Python was originally third party.

I’ve often felt I should programme in a more object-oriented style but this book has made me reconsider. It’s quite clear that spaghetti code can be written in an object oriented language as well as any other. And I suspect the data processing for which I am normally coding fits very well with a functional style of coding. The ideas of single responsibility functions, and testing still fit well with more functional programming styles.

Working effectively is readable and pragmatic. I suspect the developer’s dirty secret is that actually we wrote the legacy code that we’re now trying to fix.

Book review: Weapons of Math Destruction by Cathy O’Neil

weapons_of_math_destructionObviously for any UK anglophone the title of Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil is going to be a bit grating. The book is an account of how algorithms can ruin people’s lives. To a degree the “Big Data” in the subtitle is incidental.

Cathy O’Neil started her career as a mathematician before worked for the Shaw Hedge Fund as a quant before moving to Instant Media to work as a data scientist. It’s nice to know that I’m not the only person to have become a data scientist largely by writing “data scientist” on their CV! Nowadays she is an activist in the Occupy movement.

The book is the result of O’Neil’s revelation that algorithms were often used destructively, and are responsible for gross injustices. Algorithms in this case are models that determine how companies, and sometimes government, deal with their employees, customers and citizens; whether they are offered loans, adverts of a particular sort, employment, termination or a lengthy prison sentence.

The book starts with her experience at Shaw where she saw the subprime mortgage crisis from quite close up. In a nutshell: the subprime mortgage crisis happened because it was in the interests of most of the players in the industry for the stated risk of these mortgages to be minimised. The ratings agencies were paid by the aggregators of these mortgages to rate their risk, and the purchasers of these risk ratings had an interest in those ratings to be low – the ratings agencies duly obliged.

The book goes on to cover a number of other “Weapons of Math Destruction”, including models for recruitment, insurance, credit rating, scheduling (for work), politics and policing. So, for example, there are the predictive policing algorithms which will direct the police for particular parts of town in an effort to reduce serious crime but where the police will consequently record more anti-social behaviour which will lead the algorithm to send them there again because it turns out that serious crime is quite rare but anti-social behaviour isn’t (so there’s more data to draw on). And the police in a number of countries are following the “zero-tolerance” model which says if you address minor misdemeanours then more serious crimes are fixed automatically. The problem in the US with this approach is that the police are sent to black neighbourhoods repeatedly (rather than, say, college campuses) and the model is self-reinforcing.

O’Neil identifies several systematic problems which are typically of Weapons of Math Destruction. These are the use of proxies rather than “real outcomes”, the lack of feedback from outcomes to the model, the scale on which the model impacts people, the lack of fairness built into the model, the opacity of the models and the damage the models can do. The damage is extensive, these WMDs can lead to you being arrested, incarcerated for lengthy periods, denied a job, denied medical insurance, and offered loans at most extortionate rates to complete courses at rather low rate universities.

The book is focused almost entirely on the US, in fact the only mention of a place outside the US is of policing in the “city of Kent”. However, O’Neil does seem to rate the data and privacy legislation in Europe – where consumers should be told of the purposes to which data will be put when they supply it. Even in the States the law provides some limits on certain types of model (such as credit scoring) but these laws have not kept pace with new developments, nor are they necessarily easy to use. For example, if your credit score is wrong fixing it although legally mandated is not quick and easy.

Perhaps her most telling comment is that computers don’t understand fairness, and certainly don’t exhibit fairness if they are not asked to optimise for it. Which does lead to the question “How do you implement fairness?”. In some cases it is obvious: you shouldn’t make use of algorithms which explicitly take into account gender, race or disability. But it’s easy to inadvertently bring in these parameters by, for example, postcode being correlated with race. Or part-time working being correlated with gender or disability.

As a middle aged, middle class white man with a reasonably well-paid job, living in a nice part of town I am least likely to find myself on the wrong end of an algorithm and ironically the most likely to be writing such algorithms.

I found the book very thought-provoking, it will certainly lead me to ask me whether the algorithms and data that I am generating are fair and what the cost of any unfairness is.

Book review: Elasticsearch–The Definitive Guide by Clinton Gormley & Zachary Tong

elasticsearchBack to technology with this blog post and a review of Elasticsearch – The Definitive Guide by Clinton Gormley and Zachary Tong. The book is available for free online, and probably more up to date (here), that said Elasticsearch seems to be quite stable now. I have a dead tree copy because I’m old-fashioned.

Elasticsearch is a full-text search engine based on the Apache Lucene project. I was first made aware of it when I was working at ScraperWiki where we used it for a proof of concept system for analysing legalisation from many countries (I wasn’t involved hands-on with this work). Recently, I used it to make a little auto-completion web form for company names using the Companies House dataset. From download to implementing a solution which was x1000 times faster than a naive SQL querying system took less than a day – the default configuration and system is that good!

You can treat Elasticsearch like a SQL database to a fair degree, what it refers to indexes are what would be separate databases on a SQL server. Elasticsearch refers to document types instead of tables, and what would be rows in a SQL database are called “documents”. There are no joins as such in Elasticsearch but there are a number of workarounds such as parent-child relationships, nested objects or plain old denormalisation. I suspect one needs to be a bit cautious of treating Elasticsearch as a funny looking SQL database.

The preferred way to interact with Elasticsearch is using the HTTP API, this means that once installed you can prod away at your Elasticsearch database using curl from the commandline or the  Sense plugin for Google Chrome. The book is liberally scattered with examples written as HTTP requests, and online these can be launched from the browser (given a bit of configuration). To my mind the only downside of this is that queries are written in JSON which introduces a lot of extraneous brackets and quoting. For my experiments I moved quickly to using the Python interface which seems well-supported and complete (as do other language bindings).

Elasticsearch: The Definitive Guide is divided into 7 sections: Getting started, Search in Depth, Dealing with Human Language, Aggregations, Geolocation, Modelling your data, and finishes with Administration, Monitoring and Deployment.

The Getting Started section of the book covers everything you need to get you going but no single topic in any depth. The subsequent sections are largely about filling in that detail. The query language is completely different to SQL and queries come back with results ranked by a relevance score. I suspect this is where I’ll find myself working a lot in future, currently my queries give me a set of results which I filter in Python. I suspect I could write better queries which would return relevance scores which matched my application (and that I would trust). As it stands my queries always return *something* which may or may not be what I want.

I found the material regarding analyzers (which are applied to searchable fields and, symmetrically, search terms) very interesting and applicable to wider search problems where Elasticsearch is not necessarily the technology to be used. There is an overlap here with natural language processing in the sense that analyzers can include tokenizers, stemmers, and synonym lookups which are all part of the NLP domain. This is expanded on further in the “Dealing with human language” section.

The section on aggregations explains Elasticsearch’s “group by”-like functionality, and that on geolocation touches on spatial extension-like behaviour. Elasticsearch handles geohashes which are a relatively recent innovation in encoding spatial coordinates.

The book mentions very briefly the ELK stack which is Elasticsearch, Logstash and Kibana (all available from the elastic website). This is used to analyse log files, logstash funnels the log data into elasticsearch where it is visualised using Kibana. I tried out kibana briefly, its an easy to use visualising frontend.

Elasticsearch is a Big Data technology from the start which means it supports sharding, replication and distribution over nodes out of the box but it runs fine on a simple single node such as my laptop.

Elasticsearch is a pretty big book but the individual chapters are pretty short and to the point. As I’d expect from O’Reilly Elasticsearch is well-edited, and readable. I found it great for working out what all the parts of Elasticsearch are and now know what exists when it comes to solving live problems. The book is pretty good at telling you which things you can do, and which things you should do.