This review was first published at ScraperWiki.
Marginalia are an insight into the mind of another reader. This struck me as a I read Data Science for Business by Foster Provost and Tom Fawcett. The copy of the book had previously been read by two of my colleagues. One of whom had clearly read the introductory and concluding chapters but not the bit in between. Also they would probably not be described as a capitalist, “red in tooth and claw”! My marginalia have generally been hidden since I have an almost religious aversion to defacing a book in any way. I do use Evernote to take notes as I go though, so for this review I’ll reveal them here.
Data Science for Business is the book I wasn’t going to read since I’ve already read Machine Learning in Action, Data Mining: Practical Machine Learning Tools and Techniques, and Mining the Social Web. However, I gave in to peer pressure. The pitch for the book is that it is for people who will manage data scientists rather than necessarily be data scientists themselves. The implication here is that you’re paying these data scientists to increase your profits, so you better make sure that’s what they’ll do. You need to be able to understand what data science can and cannot do, ask reasonable questions of data scientists of their models and understand the environment the data scientist needs to thrive.
The book covers several key algorithms: decision trees, support vector machines, logistic regression, k-Nearest Neighbours and term frequency-inverse document frequency (TF-IDF) but not in any great depth of implementation. To my mind it is surprisingly mathematical in places, given the intended audience of managers rather than scientists.
The strengths of the book are in the explanations of the algorithms in visual terms, and in its focus on the expected value framework for evaluating data mining models. Diversity of explanation is always a good thing; read enough different explanations and one will speak directly to you. It also spends more of its time discussing practical applications than other books on data mining. An example on “churn” runs through the book. “Churn” is the loss of customers at the end of a contract, in this case the telecom industry is used as an illustration.
A couple of nuggets I picked up:
- You can think of different machine learning algorithms in terms of the decision boundary they produce and how that looks. Overfitting becomes a decision boundary which is disturbingly intricate. Support vector machines put the decision boundary as far away from the classes they separate as possible;
- You need to make sure that the attributes that you use to build your model will be available at the point of use. That’s to say there is no point in building a model for churn which needs an attribute from a customer which is only available just after they’ve left you. Sounds a bit obvious but I can easily see myself making this mistake;
- The expected value framework for evaluating models. This combines the probability of an event, i.e. the result of a promotion campaign with the value of the outcome. Again churn makes a useful demonstration. If you have the choice between a promotion which is successful with 10 users with an average spend of £10 per year or 1 user with an average spend of £200 then you should obviously go with the latter rather than the former. This reminds me of expectation values in quantum mechanics and in statistical physics.
The title of the book, and the related reading demonstrate that data science, machine learning and data mining are used synonymously. I had a quick look at the popularity of these terms over the last few years. You can see the results in the Google Ngram viewer here. Somewhat to my surprise data science still lags far behind other terms despite the recent buzz, this is perhaps because Google only expose data to 2008.
Which book should you read?
All of them!
If you must buy only one then make it Data Mining, it is encyclopaedic and covers high level business overview, toy implementation and detailed implementation in some depth. If you want to see the code, then get Machine Learning in Action – but be aware that ultimately you are most likely going to be using someone else’s implementation of the core machine learning algorithms. Mining the Social Web is excellent if you want to see the code and are particularly interested in social media. And read Data Science for Business if you are the intended managerial audience or one who will be doing data mining in a commercial environment.