Tag: data science

Book review: JavaScript Patterns by Stoyan Stefanov

More technology related reviewing next, JavaScript Patterns by Stoyan Stefanov. This is part of my continuing effort to learn JavaScript.

For me this isn’t a question of learning the nuts and bolts of a language but rather one of learning to use it fluently and idiomatically.

I thought this book might be in the spirit of the original “Gang of four” design patterns, but although it mentions these design patterns it is more generally about good style in JavaScript. The book is divided into eight chapters including an introduction.

The first substantive chapter on “essentials” talks mainly about variable declarations and some odds and ends. The most interesting one of these was the behaviour of parseInt which converts a string into an integer. Except if the string starts with a zero, as ISO8601 days and months would, then parseInt assumes it is a number in base 8 (octal)!! I can foresee many long hours trying to debug this problem without this forewarning. This chapter also discusses the importance of coding style conventions.

The second chapter talks about literals and constructors. It strikes me that much of this is about unwinding the behaviour of developers more used to statically-typed languages. The JavaScript way is to create objects by example, rather than take a class definition and derive from that. Although in the permissive manner of many languages it will let you do it either way. Since this book was written JavaScript has gained a “class” keyword which allows you to construct classes as you might in Java or C#.

Next up are functions, JavaScript shares Python’s view of functions as objects, allowing them to be passed as arguments. This is particularly important in JavaScript to provide “callback” functionality which is very useful when doing asynchronous programming. I learn here that the “currying” of function is named after Haskell Curry, who also has a whole language named for him. I always feel when passing functions as arguments that I am fiddling with the underpinnings of reality – it can make debugger difficult too.

I found the idea of functions that redefine themselves on first run interesting, it sounds useful and dangerous at the same time.

The chapter on object creation patterns is all about introducing module like behaviour and namespacing to JavaScript which at the time the book was written were not part of the language. Also covered are making private properties by hiding them in function closures.

The code reuse chapter is largely about patterns for achieving inheritance-like behaviour. This introduces a range of patterns which build up to almost exactly replicate class-based inheritance.

Finally we meet some of the classic Gang of Four design patterns. Some of these patterns, such as the iterator pattern, have been absorbed entirely into the core of languages like Python and more recently, JavaScript. The Observer patterns is implemented in web browsers as events, which are ubiquitous. Perhaps the lesson of this chapter is that some of the Gang of Four patterns have been absorbed into the core of languages, we use them almost without thinking. The Strategy Pattern, which determines algorithms at runtime, fits well with the chapter on functions and JavaScript’s view of functions as objects.

The book finishes with a chapter on patterns for the Document Object Model, or rather JavaScript in the browser. It includes well-known advice such as not testing for browser type but rather testing for functionality. It also has advice on optimising JavaScript for deployment.

There is minimal mention of specific tools or libraries in this regard, although Yahoo’s YUI library is mentioned a few times – Stefanov has worked on this library so this is unsurprising, and not unreasonable.

This book had more of the air of Douglas Crockfords’ JavaScript: The Good Parts than a book on patterns which was what I was expecting. Alternatively perhaps “JavaScript for users of statically-typed languages”, as such it probably works pretty well for Python programmers too although modules have always been built-in to Python and there is a “class” keyword for specifying classes.

JavaScript Patterns is readable though, I’m glad I picked it up.

Book review: Data Strategy by Bernard Marr

This is a review of data_strategyData Strategy by Bernard Marr. The proposition of the book is that all businesses are now data businesses and that they should have a strategy to exploit data. He envisages such a strategy operating through a Chief Data Officer and thus at the highest level of a company.

It is in the nature of things that to be successful you feel that you have to be saying something new and interesting. The hook for this book is big data, or the increasing availability of data, is a new and revolutionary thing. To be honest, I don’t really buy this but once we’re over the hook the advice contained within is rather good.

Marr sees data benefitting businesses in three ways, and covers these in successive chapters:

  1. It can support business decisions – that’s to say helping humans make decisions;
  2. It can support business operations – this is more the automated use of data, for example, a recommender algorithm you might come across at any number of retail sites is driven by data and falls into this category;
  3. It can be an asset in its own right;

This first benefit of supporting business decisions is further sub-divided into data about the following:

  1. Customers, markets and competition
  2. Finance
  3. Internal operations
  4. People

The chapter on supporting business operations contains quite a lot of material on using sensors in manufacturing and warehouse operations but also includes fraud detection.

Subsequent chapters cover how to source and collect data, provide the human and physical infrastructure to draw meaning from it and some comments on data governance. In Europe this last topic has been the subject of enormous activity over the past couple of years with the introduction of the General Data Protection Regulation (GDPR) which determines the way in which personal information can be collected and processed.

Following the theme of big data, Marr’s view is that the the past is represented by data in SQL tables whilst the future is in unstructured data sources.

My background is as a physical scientist, and as such I read this with a somewhat quizzical “You’re not doing this already” face. Pretty much the whole point of a physical scientist is to collect data to better understand the world. The physical sciences have never really had a big data moment, typically we have collected and analysed data to the limit of currently available technology but that has never been the thing itself. Philosophically physical sciences gave up on collecting “all of the data” long ago. One of the unappreciated features of the big detectors a CERN is their ability to throw away enormous quantities of data really fast. If you have what is a effectively a building sized CCD camera then it is the only strategy that works. This isn’t to say the physical sciences always do it right, or that they are directly relevant to businesses. Physical sciences work on the basis that there are universally applicable, immutable physical laws which data is used to establish. This is not true of businesses, what works for one business need not work for another, what works now need not work in the future.

Reading the book I kept thinking of A computer called Leo by Georgina Ferry, which describes the computer built by the J. Lyons company (who ran a teashop and catering business) in the 1950s. Lyons had been doing large scale data work since the 1920s, in the aftermath of the Second World War they turned to automated, electronic computation. From my review I see that Charles Babbage wrote about the subject in 1832 although he was writing more about prospects for the future. IBM started its growth in computing machinery in the late 19th century. So the idea of data being core to a business is by no means new.

The text is littered with examples of data collection for business good across a wide range of sectors. The Rolls-Royce’s engine monitoring programme is one of my favourites, their engines send data back to Rolls-Royce four times during each flight. This can be used to support engine servicing, and, I would imagine, product development. In the category of monetizing data American Express and Axciom are mentioned, they provide either personal or aggregate demographic information which can be used for targeted marketing.

Some entries might be a bit surprising, restaurant chains in the form of Dominos and Dickey’s Barbeque Pit are big users of data. Walmart also makes an appearance. This shouldn’t be surprising since the importance of data is a matter more of business scale than sector, as the Lyons company shows.

Marr repeatedly tells us that we should collect the data which answers our questions rather than just trying to collect all the data. I don’t think this can be repeated too often! It seems that many businesses have been sold (or built) Big Data infrastructure and only really started to think about how they would extract business value from the data collected once this has been done.

Definitely thought provoking, and a well-structured guide as to how data can benefit your company.

Book review: Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron

machine-learningI’ve recently started playing around with recurrent neural networks and tensorflow which brought me to Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron, as a bonus it also includes material on scikit-learn which I’ve been using for a while.

The book is divided into two parts, the first, “Fundamentals of Machine Learning” focuses on the functionality which is found in the scikit-learn library. It starts with a big picture, running through the types of machine learning which exist (supervised / unsupervised, batched / online and instance / model) and then some of the pitfalls and problems with machine learning before a section on testing and validation. The next part is a medium sized example of machine learning in action which demonstrates how the functionality of scikit-learn can be quickly used to develop predictions of house prices in California based on census data. This is a subject after my own heart, I’ve been working property data in the UK for the past couple of years.

This example serves two purposes, firstly it demonstrates the practical steps you need to take when undertaking a machine learning exercise and secondly it highlights just how concisely much of it can be executed in scikit-learn. The following chapters then go into more depth first about how models are trained and scored and then going into the details of different algorithms such as Support Vector Machines and Decision Trees. This part finishes with a chapter on ensemble methods.

Although the chapters contain some maths, their strength is in the clear explanations of the methods described. I particularly liked the chapter on ensemble methods. They also demonstrate how consistent the scikit-learn library is in its interfaces. I knew that I could switch algorithms very easily with scikit-learn but I hadn’t fully appreciated how the libraries generally handled regression and multi-class classification so seamlessly.

I wonder whether outside data science it is perceived that data scientists write their own algorithms from scratch. In practical terms it is not the case, and hasn’t been the case since at least the early nineties when I started data analysis which looks very similar to the machine learning based analysis I do today. In those days we used the NAG numerical library, Numerical Recipes in FORTRAN and libraries developed by a very limited number of colleagues in the wider academic community (probably shared by email attachment).

The second part of the book, “Neural networks and Deep Learning”, looks at the tensorflow library. Tensorflow has general applications for processing multi-dimensional arrays but it has been designed with deep learning and neural networks in mind. This means there are a whole bunch of functions to generate and train neural networks of different types and different architectures.

The section starts with an overview of tensorflow with some references to other deep learning libraries, before providing an introduction to neural networks in general, which have been around quite a while now. Following this there is a section on training deep learning networks, and the importance of the form of activation functions.

Tensorflow will run across multiple processors, GPUs and/or servers although configuring this looks a bit complicated. Typically a neural network layer can’t be distributed over multiple processing units.

There then follow chapters on convolution neural networks (good for image applications), recurrent neural networks (good for sequence data), autoencoders (finding compact representations) and finally reinforcement learning (good for playing pac-man). My current interest is in recurrent neural networks, it was nice to see a brief description of all of the potential input/output scenarios for recurrent neural networks and how to build them.

I spent a few years doing conventional image analysis, convolution neural networks feel quite similar to the convolution filters I used then although they stack more layers (or filters) than are normally used in conventional image analysis. Furthermore, in conventional image analysis the kernels are typically handcrafted to perform certain tasks (such as detect horizontal or vertical edges), whilst neural networks learn their kernels in training. In conventional image analysis convolution is often done in Fourier space since it is more efficient and I see there experiments along these lines with convolution neural networks.

Developing and training neural networks has the air of an experimental science rather than a theoretical science. That’s to say that rather than thinking hard and coming up with an effective neural network and training scheme one needs to tinker with different designs and training methods and see how well they work. It has more the air of training an animal the programming a computer. There are a number of standard training / test sets of images and successful models trained against these by other people can be downloaded. Such models can be used as-is but alternatively just parts can be used.

This section has many references to the original literature for the development of deep learning, highlighting how recent this new iteration of neural networks is.

Overall an excellent book, scikit-learn and tensorflow are the go-to libraries for Python programmers wanting to do machine learning and deep learning respectively. This book describes their use eloquently, with references to original literature where appropriate whilst providing a good overview of both topics. The code used in the book can be found on github, as a set of Jupyter Notebooks.

Book review: Fraud analytics by B. Baesens, V. Van Vlasselaer and W. Verbeke

This next book is rather work oriented: fraud_analyticsFraud Analytics using descriptive, predictive and social network techniques: A guide to data science for fraud detection by Bart Baesens, Veronique van Vlasselaer and Wouter Verbeke.

Fraud analytics starts with an introductory chapter on the scale of the fraud problem, and some examples of types of fraud. It also provides an overview of the chapters that are to come. In the UK fraud losses stand at about £73 billion per annum, typically fraud losses are anything up to 5%. There are many types of fraud: credit card fraud, insurance fraud, healthcare fraud, click fraud, identity theft and so forth.

There then follows a chapter on data preparation, sampling and preprocessing. This includes some domain related elements such as the importance of the so-called RFM attributes: Recency, Frequency, and Monetary which are the core variables for financial transactions. Also covered are missing values and data quality which are more general issues in statistics.

The core of the book is three long chapters on descriptive statistics, predictive analysis and social networks.

Descriptive statistics concerns classical statistical techniques such as the detection of outliers using the z-score (the normalised standard deviation), through the clustering techniques such as k-means or related techniques. These clustering techniques fall into the category of unsupervised machine learning. The idea here is that fraudulent transactions are different to non-fraudulent ones, this may be a temporal separation (i.e. a change in customer behaviour may indicate that their account has been compromised and used nefariously) or it might be a snapshot across a population where fraudulent actors have different behaviour than non-fraudulent ones. Clustering techniques and outlier detection seek to identify these “different” transactions, usually for further investigation – that’s to say automated methods are used as a support for human investigators not a replacement. This means that ranking transactions for potential fraud is key. Obviously fraudsters are continually adapting their behaviour to avoid standing out, and so fraud analytics is an arms-race.

Predictive analysis is more along the lines of regression, classification and machine learning. The idea here is to develop rules for detecting fraud from training sets containing example transactions which are known to be fraudulent or not-fraudulent.Whilst not providing an in depth implementation guide Fraud Analytics gives a very good survey of the area. It discusses different machine learning algorithms, including their strengths and weaknesses particularly with regard to model “understandability”. Also covered are a wide range of model evaluation methods, and the importance of an appropriate training set. A particular issue here is that fraud is relatively uncommon so care needs to be taken in sampling training sets such that algorithms have a chance to identify fraud. These are perennial issues in machine learning and it is good to see them summarised here.

The chapter on social networks clearly presents an active area of research in fraud analytics. It is worth highlighting here that the term “social” is meant very broadly, it is only marginally about social networks like Twitter and Facebook. It is much more about networks of entities such as the claimant, the loss adjustor, the law enforcement official and the garage carrying out repairs. Also relevant are networks of companies, and their directors set up to commit corporate frauds. Network (aka graph) theory is the appropriate, efficient way to handle such systems. In this chapter, network analytic ideas such as “inbetweeness” and “centrality” are combined with machine learning involving non-network features.

The book finishes with chapters on fraud analytics in operation, and a wider view. How do you use these models in production? When do you update them? How do you update them? The wider view includes some discussion of data anonymisation prior to handing it over to data scientists. This is an important area, data protection regulations across the EU are tightening up, breaches of personal data can have serious consequences for those companies involved. Anonymisation may also provide some protection against producing biased models i.e those that discriminate unfairly against people on the basis of race, gender and economic circumstances. Although this area should attract more active concern.

A topic not covered but mentioned a couple of times is natural language processing, for example analysing the text of claims against insurance policies.

It is best to think of this book as a guide to various topics in statistics and data science as applied to the analysis of fraud. The coverage is more in the line of an overview, rather than an in depth implementation guide. It is pitched at the level of the practitioner rather than the non-expert manager. Aside from some comments at the end on label-based security access control (relating to SQL) and some screenshots from SAS products it is technology agnostic.

Occasionally the English in this book slips from being fully idiomatic, it is still fully comprehensible – it simply reads a little oddly. Not a fun read but an essentially starter if you’re interested in fraud and data science.

Scala – installation behind a workplace web proxy

I’ve been learning Scala as part of my continuing professional development. Scala is a functional language which runs primarily on the Java Runtime Environment. It is a first class citizen for working with Apache Spark – an important platform for data science. My intention in learning Scala is to get myself thinking in a more functional programming style and to gain easy access to Java-based libraries and ecosystems, typically I program in Python.

In this post I describe how to get Scala installed and functioning on a workplace laptop, along with its dependency manager, sbt. The core issue here is that my laptop at work puts me behind a web proxy so that sbt does not Just Work™. I figure this is a common problem so I thought I’d write my experience down for the benefit of others, including my future self.

The test system in this case was a relatively recent (circa 2015) Windows 7 laptop, I like using bash as my shell on Windows rather than the Windows Command Prompt – I install this using the Git for Windows SDK.

Scala can be installed from the Scala website https://www.scala-lang.org/download/. For our purposes we will use the  Windows binaries since the sbt build tool requires additional configuration to work. Scala needs the Java JDK version 1.8 to install and the JAVA_HOME needs to point to the appropriate place. On my laptop this is:

JAVA_HOME=C:\Program Files (x86)\Java\jdk1.8.0_131

The Java version can be established using the command:

javac –version

My Scala version is 2.12.2, obtained using:

scala -version

Sbt is the dependency manager and build tool for Scala, it is a separate install from:

http://www.scala-sbt.org/0.13/docs/Setup.html

It is possible the PATH environment variable will need to be updated manually to include the sbt executables (:/c/Program Files (x86)/sbt/bin).

I am a big fan of Visual Studio Code, so I installed the Scala helper for Visual Studio Code:

https://marketplace.visualstudio.com/items?itemName=dragos.scala-lsp

This requires a modification to the sbt config file which is described here:

http://ensime.org/build_tools/sbt/

Then we can write a trivial Scala program like:

object HelloWorld {

 

def main(args: Array[String]): Unit = {

 

    println(“Hello, world!”)

 

  }

 

}

And run it at the commandline with:

scala first.scala

To use sbt in my workplace requires proxies to be configured. The symptom of a failure to do this is that the sbt compile command fails to download the appropriate dependencies on first run, as defined in a build.sbt file, producing a line in the log like this:

[error] Server access Error: Connection reset url=https://repo1.maven.org/maven2/net/
sourceforge/htmlcleaner/htmlcleaner/2.4/htmlcleaner-2.4.pom

In my case I established the appropriate proxy configuration from the Google Chrome browser:

chrome://net-internals/#proxy

This shows a link to the pacfile, something like:

http://pac.madeupbit.com/proxy.pac?p=somecode

The PAC file can be inspected to identify the required proxy, in my this case there is a statement towards the end of the pacfile which contains the URL and port required for the proxy:

if (url.substring(0, 5) == ‘http:’ || url.substring(0, 6) == ‘https:’ || url.substring(0, 3) == ‘ws:’ || url.substring(0, 4) == ‘wss:’)

{

return ‘PROXY longproxyhosturl.com :80’;

}

 

These are added to a SBT_OPTS environment variable which can either be set in a bash-like .profile file or using the Windows environment variable setup.

export SBT_OPTS=”-Dhttps.proxyHost=longproxyhosturl.com -Dhttps.proxyPort=80 -Dhttps.proxySet=true”

As a bonus, if you want to use Java’s Maven dependency management tool you can use the same proxy settings but put them in a MAVEN_OPTS environment variable.

Typically to start a new project in Scala one uses the sbt new command with a pointer to a g8 template, in my workplace this does not work as normally stated because it uses the github protocol which is blocked by default (it runs on port 9418). The normal new command in sbt looks like:

sbt new scala/scala-seed.g8

The workaround for this is to specify the g8 repo in full including the https prefix:

sbt new https://github.com/scala/scala-seed.g8

This should initialise a new project, creating a whole bunch of standard directories.

So far I’ve completed one small project in Scala. Having worked mainly in dynamically typed languages it was nice that, once I had properly defined my types and got my program to compile, it ran without obvious error. I was a bit surprised to find no standard CSV reading / writing library as there is for Python. My Python has become a little more functional as a result of my Scala programming, I’m now a bit more likely to map a function over a list rather than loop over the list explicitly.

I’ve been developing intensively in Python over the last couple of years, and this seems to have helped me in configuring my Scala environment in terms of getting to grips with module/packaging, dependency managers, automated doocumentation building and also in finding my test library (http://www.scalatest.org/) at an early stage.