This next book is rather work oriented: Fraud Analytics using descriptive, predictive and social network techniques: A guide to data science for fraud detection by Bart Baesens, Veronique van Vlasselaer and Wouter Verbeke.
Fraud analytics starts with an introductory chapter on the scale of the fraud problem, and some examples of types of fraud. It also provides an overview of the chapters that are to come. In the UK fraud losses stand at about £73 billion per annum, typically fraud losses are anything up to 5%. There are many types of fraud: credit card fraud, insurance fraud, healthcare fraud, click fraud, identity theft and so forth.
There then follows a chapter on data preparation, sampling and preprocessing. This includes some domain related elements such as the importance of the so-called RFM attributes: Recency, Frequency, and Monetary which are the core variables for financial transactions. Also covered are missing values and data quality which are more general issues in statistics.
The core of the book is three long chapters on descriptive statistics, predictive analysis and social networks.
Descriptive statistics concerns classical statistical techniques such as the detection of outliers using the z-score (the normalised standard deviation), through the clustering techniques such as k-means or related techniques. These clustering techniques fall into the category of unsupervised machine learning. The idea here is that fraudulent transactions are different to non-fraudulent ones, this may be a temporal separation (i.e. a change in customer behaviour may indicate that their account has been compromised and used nefariously) or it might be a snapshot across a population where fraudulent actors have different behaviour than non-fraudulent ones. Clustering techniques and outlier detection seek to identify these “different” transactions, usually for further investigation – that’s to say automated methods are used as a support for human investigators not a replacement. This means that ranking transactions for potential fraud is key. Obviously fraudsters are continually adapting their behaviour to avoid standing out, and so fraud analytics is an arms-race.
Predictive analysis is more along the lines of regression, classification and machine learning. The idea here is to develop rules for detecting fraud from training sets containing example transactions which are known to be fraudulent or not-fraudulent.Whilst not providing an in depth implementation guide Fraud Analytics gives a very good survey of the area. It discusses different machine learning algorithms, including their strengths and weaknesses particularly with regard to model “understandability”. Also covered are a wide range of model evaluation methods, and the importance of an appropriate training set. A particular issue here is that fraud is relatively uncommon so care needs to be taken in sampling training sets such that algorithms have a chance to identify fraud. These are perennial issues in machine learning and it is good to see them summarised here.
The chapter on social networks clearly presents an active area of research in fraud analytics. It is worth highlighting here that the term “social” is meant very broadly, it is only marginally about social networks like Twitter and Facebook. It is much more about networks of entities such as the claimant, the loss adjustor, the law enforcement official and the garage carrying out repairs. Also relevant are networks of companies, and their directors set up to commit corporate frauds. Network (aka graph) theory is the appropriate, efficient way to handle such systems. In this chapter, network analytic ideas such as “inbetweeness” and “centrality” are combined with machine learning involving non-network features.
The book finishes with chapters on fraud analytics in operation, and a wider view. How do you use these models in production? When do you update them? How do you update them? The wider view includes some discussion of data anonymisation prior to handing it over to data scientists. This is an important area, data protection regulations across the EU are tightening up, breaches of personal data can have serious consequences for those companies involved. Anonymisation may also provide some protection against producing biased models i.e those that discriminate unfairly against people on the basis of race, gender and economic circumstances. Although this area should attract more active concern.
A topic not covered but mentioned a couple of times is natural language processing, for example analysing the text of claims against insurance policies.
It is best to think of this book as a guide to various topics in statistics and data science as applied to the analysis of fraud. The coverage is more in the line of an overview, rather than an in depth implementation guide. It is pitched at the level of the practitioner rather than the non-expert manager. Aside from some comments at the end on label-based security access control (relating to SQL) and some screenshots from SAS products it is technology agnostic.
Occasionally the English in this book slips from being fully idiomatic, it is still fully comprehensible – it simply reads a little oddly. Not a fun read but an essentially starter if you’re interested in fraud and data science.