Book review: Fraud analytics by B. Baesens, V. Van Vlasselaer and W. Verbeke

This next book is rather work oriented: fraud_analyticsFraud Analytics using descriptive, predictive and social network techniques: A guide to data science for fraud detection by Bart Baesens, Veronique van Vlasselaer and Wouter Verbeke.

Fraud analytics starts with an introductory chapter on the scale of the fraud problem, and some examples of types of fraud. It also provides an overview of the chapters that are to come. In the UK fraud losses stand at about £73 billion per annum, typically fraud losses are anything up to 5%. There are many types of fraud: credit card fraud, insurance fraud, healthcare fraud, click fraud, identity theft and so forth.

There then follows a chapter on data preparation, sampling and preprocessing. This includes some domain related elements such as the importance of the so-called RFM attributes: Recency, Frequency, and Monetary which are the core variables for financial transactions. Also covered are missing values and data quality which are more general issues in statistics.

The core of the book is three long chapters on descriptive statistics, predictive analysis and social networks.

Descriptive statistics concerns classical statistical techniques such as the detection of outliers using the z-score (the normalised standard deviation), through the clustering techniques such as k-means or related techniques. These clustering techniques fall into the category of unsupervised machine learning. The idea here is that fraudulent transactions are different to non-fraudulent ones, this may be a temporal separation (i.e. a change in customer behaviour may indicate that their account has been compromised and used nefariously) or it might be a snapshot across a population where fraudulent actors have different behaviour than non-fraudulent ones. Clustering techniques and outlier detection seek to identify these “different” transactions, usually for further investigation – that’s to say automated methods are used as a support for human investigators not a replacement. This means that ranking transactions for potential fraud is key. Obviously fraudsters are continually adapting their behaviour to avoid standing out, and so fraud analytics is an arms-race.

Predictive analysis is more along the lines of regression, classification and machine learning. The idea here is to develop rules for detecting fraud from training sets containing example transactions which are known to be fraudulent or not-fraudulent.Whilst not providing an in depth implementation guide Fraud Analytics gives a very good survey of the area. It discusses different machine learning algorithms, including their strengths and weaknesses particularly with regard to model “understandability”. Also covered are a wide range of model evaluation methods, and the importance of an appropriate training set. A particular issue here is that fraud is relatively uncommon so care needs to be taken in sampling training sets such that algorithms have a chance to identify fraud. These are perennial issues in machine learning and it is good to see them summarised here.

The chapter on social networks clearly presents an active area of research in fraud analytics. It is worth highlighting here that the term “social” is meant very broadly, it is only marginally about social networks like Twitter and Facebook. It is much more about networks of entities such as the claimant, the loss adjustor, the law enforcement official and the garage carrying out repairs. Also relevant are networks of companies, and their directors set up to commit corporate frauds. Network (aka graph) theory is the appropriate, efficient way to handle such systems. In this chapter, network analytic ideas such as “inbetweeness” and “centrality” are combined with machine learning involving non-network features.

The book finishes with chapters on fraud analytics in operation, and a wider view. How do you use these models in production? When do you update them? How do you update them? The wider view includes some discussion of data anonymisation prior to handing it over to data scientists. This is an important area, data protection regulations across the EU are tightening up, breaches of personal data can have serious consequences for those companies involved. Anonymisation may also provide some protection against producing biased models i.e those that discriminate unfairly against people on the basis of race, gender and economic circumstances. Although this area should attract more active concern.

A topic not covered but mentioned a couple of times is natural language processing, for example analysing the text of claims against insurance policies.

It is best to think of this book as a guide to various topics in statistics and data science as applied to the analysis of fraud. The coverage is more in the line of an overview, rather than an in depth implementation guide. It is pitched at the level of the practitioner rather than the non-expert manager. Aside from some comments at the end on label-based security access control (relating to SQL) and some screenshots from SAS products it is technology agnostic.

Occasionally the English in this book slips from being fully idiomatic, it is still fully comprehensible – it simply reads a little oddly. Not a fun read but an essentially starter if you’re interested in fraud and data science.

Review of the year: 2017

As I finish work for the year, and we await Christmas Day, it is time for me to start writing my “Review of the year”. This is a somewhat partial view of the world, as seen through the pages of my blog which these days is almost entirely book reviews, you can see a list of my blog posts for the year here. My Goodreads account tells me I have read 32 books this year.

Linked to reading, I wrote a post on Women Writers – I’ve been making an effort to read more books written by women over the last couple of years. This has worked out really well for my fiction reading, where I’ve found some new sci-fi authors to enjoy, and some, like Ursula Le Guin who have been around a while. Le Guin’s The Left Hand of Darkness is certainly in contention for my favourite novel ever. On non-fiction I’ve not had as much success – a chunk of my non-fiction reading is in technology and the number of women published in this area is tiny. I found the acknowledgements section of books by men a useful place to find women to follow on twitter.

This year I read Pandora’s Breeches by Patricia Fara – about women in science from about 1600 to 1850. I also read Hidden Figures by Margot Lee Shetterly, about the Africa-American women who worked as “human computers” for the organisation which was to become NASA. I think this told me more about being an African-American than being a woman. I hadn’t appreciated previously the sheer effort and determination required for African-Americans to progress, changing the laws to end legally-sanctioned discrimination was simply the first step (resisted at every turn by white supremacists).

I read some fairly academic history of science too, Inventing Temperature and Leviathan and the air-pump. Inventing Temperature is about the history of the measurement of temperature. Temperature is important to most physical scientists in one way or another, perhaps more so for ones like I once was. This book covers the less-told history, and re-surfaces some of the assumptions that these days are no longer taught or certainly don’t stick in the mind.  Leviathan and the air-pump is about the foundation of the experimental method as it is (roughly) seen today. I liked these two books because they didn’t follow the “great man” narrative which is what you get from reading scientific biographies – a much more common genre in the wider history of science.

I also read a few books on the history of Chester, following on from reading about Roman Chester last year. Two things struck me in this, one was the image of post-Roman Britons living in the ruins of the Roman occupation. Evidence from this period immediately following the Roman occupation, in Chester it amounts to a thin dark layer of material in the Roman barracks which could well be pigeon droppings! The second stand out was the fact that Chester’s mint/money making operation was bigger than London’s in the 9th century. I was also interested in the “Pentice” a curious timber structure attached to the St Peter’s church by the cross in the centre of town that appears to have been Chester’s administrative centre since the medieval period (it was demolished in the early 19th century).

In news outside the world of books, we had an election in the UK, the result was a bit of a surprise but we can probably agree we are not in a great position now politically with a weak government steadfastly refusing to even countenance ending the Brexit process and an official “opposition” in the Labour Party supporting them in this.

Surprise hit of the year was the ARK exhibition of sculpture at Chester Cathedral. I wouldn’t describe myself as a connoisseur of art, particularly not sculpture but I loved this exhibition. The exhibits were scattered through the cathedral and its grounds. A life-sized ceramic horse, and three very large egg-shaped objects making a very public sign of what lay within. It turns out that sculpture works really well in an old cathedral, there are so many shapes and textures to pick up on. This picture encapsulates it for me:

On the technology front I read about Scala, I’ve also wrote a post about setting up my work PC to use Scala which requires a bit of wrangling. I read about behaviour driven testing, and the potential downsides of data science from a social point of view and game theory.

A final mention goes to Ed Yong’s “I contain multitudes”, one of the first books I read this year, which is all about the interaction between microbes and the hosts they live with – including you and me. Possibly this is my favourite book of the year, but looking down the list I don’t think there was any book I regretted reading and a fair few of them were thoroughly excellent.

No holiday post this year, we were back in Portinscale, on the outskirts of Keswick again – notable achievement: getting Thomas (5) up several peaks – starting with Cat Bells! Embarrassment prevents from writing much about my Pokemon Go obsession, in my defence I will say that it is educational for Thomas and encourages him to walk places!

Book review: Leviathan and the air-pump by Steven Shapin & Simon Schaffer

leviathan-airpumpLeviathan and the air-pump by Steven Shapin & Simon Schaffer has been recommended to me by a number of people. The book discusses the dispute between Thomas Hobbes, author of Leviathan, published in 1651 and Robert Boyle, who published his first works using his scientific works involving the air-pump in 1660. It is about the foundation of the scientific experimental method.

Leviathan and the air-pump was first published in 1986, I read the 2011 second edition which has a lengthy introduction discussing reactions to the first edition of the book.

The aim of the book is to use this quite narrow case study to learn more about the rise of the “experiment” as a central activity in the way science is done. The book also explores a different way of doing the history of science, certainly when it was originally published in 1985.

I feel I am falling amongst philosophers and sociologists in reading this book, the ideas of Wittgenstein on “language-games” and “forms of life” are familiar to Mrs SomeBeans in her study for a doctorate in education.

Leviathan and the air-pump focuses on two of Boyle’s experiments in particular: his recreation of Torricelli’s experiment which sees what we now know to be a partial vacuum form above mercury in an upturned, closed cylinder and an experiment on the adhesion of smooth surfaces in a vacuum. The word “vacuum” turns out to be pivotal in the dispute with Hobbes. Hobbes held the philosophical view that there could be no such thing as a vacuum, whilst Boyle held a more mechanistic view that he did a thing which produced a space devoid of air (or much reduced in it) which he would call a “vacuum”. The book could do with a little more explanation of the modern view of these experiments. The adhesion of smooth surfaces experiment, in particular, I believe is probing a different phenomena to that which Boyle believed.

Shapin and Schaffer’s account of Boyle’s work covers both the mechanics of the experiment but also its role such experiments in generating “matters of facts”. This rests on three pillars: doing the experiments in public, a goal of replication and an experimental write up, along the lines of the modern form. The air-pump was a relatively early scientific instrument which allows some dissociation between the experimenter and the audience. Criticism of the device is not criticism of the experimenter.

Hobbes attacked Boyle on various fronts, fundamentally it did not hold with experimentation as a route to discovering the underlying causes of things. That role fell to philosophising and pure, rational, thought. Geometry was Hobbes’ model for that manner of discovery. Shapin & Schaffer discuss, briefly, other critics of Boyle. Franciscus Linus gets a somewhat patronising treatment, he is in favour of experimentation and actually does some himself but Boyle is not impressed. Henry More believes in experiments, but only to demonstrate the need for God in explaining the world.

Hobbes and the Royal Society, of which Boyle was a key figure, bore the scars of the recent English Civil War, they were desperate for peace but they sought it in different ways. The Royal Society were collegiate and sought discussion followed by agreement over matters of fact. Hobbes, on the other hand, wanted peace by authority – there was a correct answer and that should be accepted through authority. Boyle and the Royal Society wanted to demonstrate that the experimental method that they were developing allowed the generation of beneficial knowledge without rancour. I wonder whether reports of the extreme disputatiousness of Isaac Newton are a continuation of the Hobbes/Boyle argument.

It is easy to believe that this discussion between Boyle and Hobbes is long in the past but visit a physics department and see the interaction between experimental and theoretical physicists. There is a strong whiff of the Hobbesian about some theoretical physicists. Some theories pass because they are considered too beautiful to be wrong, deviations between theory and experiment are sometimes seen as a problem with the experiment (that’s not to say the experiments are perfect!). Experimentalists are seen, to a degree, as crude mechanicals.

Replication, discussed in this book, is a still-present issue. In the early years of the air-pump replication was only achieved, principally by Huygens, by those that had visited London and seen the original in action. No-one replicated the air-pump based solely on written reports. This is, to a degree, still true today. A secondary issue here is that the rewards of replication are minimal, particularly in the biological sciences where so-called p-hacking means that any experiment can produce a “significant” result that won’t be replicatable.

I enjoyed Leviathan and the air-pump, for me as a modern scientist, the detail of the dispute is fascinating. I can see the book being somewhat controversial amongst historians of science since it likely gives Hobbes more of a hearing, and more impact than previously. It also gives the political climate of the time a leading role in the creation of the experimental method, and by its narrow focus makes Boyle feel like the “inventor” of the modern experimental method. Overall, the book is pretty readable although it stretched my vocabulary in places – I found the preface to the second edition less readable than the original book.

Book review: The Art of Strategy by Avinash K. Dixit and Barry J. Nalebuff

art_of_strategyNext up, some work related reading. The Art of Strategy: A Game Theorist’s Guide to Success in Business and Life by Avinash K. Dixit and Barry J. Nalebuff.

The Art of Strategy is about game theory, a branch of economics / mathematics which considers such things as the “ultimate game” where one player choses how to split $100 (i.e. keeping $60 and giving away $40) and a second player decides to accept or reject the split, in the latter case neither of them gets any money. In the former case they get the offered split.

In the “prisoners dilemma” two prisoners are each offered the opportunity to give evidence against the other. If one of them does this, and the other doesn’t, then they will be set free, whilst their fellow prisoner services a sentence. If both betray the other then they will both serve a longer sentence than if they had both kept quiet.

These examples represent the simplest two main types of game, the ultimate game is an example of a sequential game (where one player makes a decision followed by the other) whilst the prisoners dilemma is an example of a simultaneous game (where players make their decisions simultaneously). In real life, chess is an example of a sequential game and a sealed bid auction is a simultaneous game. Games are rarely played as a single instance, simultaneous games may be repeated (“the best out of 3”), and sequentially games may involve many moves. This repetition enables the development of strategies such as “tit for tat” and punishment. 

The ultimate game and the prisoners dilemma provide a test bed for game theory, normally illustrating that real humans don’t act as the rational agents that economics intended! For example, in the ultimate game players really should accept any non-zero offer since the alternative is getting nothing, in practice players will reject offers even as high as $10 or $20 as unjust. 

Sequential games are modelled using “game trees”, which are like “decision trees”. Simultaneously games are modelled with payoff tables. The complexity of real sequential games, such as chess, means we cannot inspect all possible paths in the game tree, even with high power computing.

The first part of the The Art finishes with some strategies for simultaneous games. These are to look for dominant strategies where they are available, i.e they are the best strategy regardless of what the other players do. If this isn’t possible eliminate dominated strategies, i.e. those which are always beaten by your opponent. Nash equilibria are those moves which could not be improved, even given knowledge of an opponents moves. There can be multiple Nash equilibria in a game, which means if strategies are not explicitly stated the the players must guess which strategy the other player is using and act accordingly. This section also covers how social context influences play, and ideas of “punishment”.

The second part of the book looks at how the strategies described in the first part are used in action, although these examples are sometimes somewhat hypothetical. This part also introduces randomness (called “mixed strategies”) as a component of strategies.

The final part of the book covers applications of game theory in the real world, including auctions, bargaining and voting. I was interested to learn of the several sorts of auction, the English, Dutch, Japanese and Vickrey. The English auction is perhaps the one we are the most familiar with, participants signal when they wish to make a bid, and the bid rises with time. The Japanese auction is similar in that the bid is always rising but in this case all bidders start in the auction with their hands raised (indicating they are bidding) and put their hands down when the price is too high. A Dutch auction is one in which the price starts high, and drops, the winner is the one who first makes a bid. Finally, a Vickrey auction is a sealed-bid auction where the winner is the one the makes the highest bid, but they pay the second highest value.

Auctions are big money, the UK 3G spectrum auction in 2000 raised £22.5 billion from the participants. It’s worth spending some money to get the very best game theorists to help if you are participating in such an auction. The section on bargaining is relevant in the UK at the moment given the Brexit negotiations, particularly the idea of the Best Alternative to a Negotiated Agreement (BATNA). Players must determine their pay off relative to the BATNA, and must convince their opponents that the BATNA is as good as possible.  

I found the brief descriptions of  concrete applications of game theory such as in the various “spectrum” auctions for mobile phone systems, and the formation of price fixing cartels the most compelling part of the book.

Game theory is a central topic in at least parts of economics, as witnessed by the award of the pseudo-Nobel Prize for Economics in this area – there is a handy list here (http://lcm.csa.iisc.ernet.in/gametheory/nobel.html), if you are interested.

The Art of Strategy has some overlap with books I have read previously, the decision tree/game trees have some relevance to Risk Assessment and Decision Analysis with Bayesian Networks by Fenton and Neil (which uses the Monty Hall problem as an illustration). The Undercover Economist by Tim Harford discusses game theory and its relevance to the mobile frequency auctions in the UK, as well as the example of information in buying second hand cars. The Signal and the Noise by Nate Silver has some discussion of gaming statistics.

Scala – installation behind a workplace web proxy

I’ve been learning Scala as part of my continuing professional development. Scala is a functional language which runs primarily on the Java Runtime Environment. It is a first class citizen for working with Apache Spark – an important platform for data science. My intention in learning Scala is to get myself thinking in a more functional programming style and to gain easy access to Java-based libraries and ecosystems, typically I program in Python.

In this post I describe how to get Scala installed and functioning on a workplace laptop, along with its dependency manager, sbt. The core issue here is that my laptop at work puts me behind a web proxy so that sbt does not Just Work™. I figure this is a common problem so I thought I’d write my experience down for the benefit of others, including my future self.

The test system in this case was a relatively recent (circa 2015) Windows 7 laptop, I like using bash as my shell on Windows rather than the Windows Command Prompt – I install this using the Git for Windows SDK.

Scala can be installed from the Scala website https://www.scala-lang.org/download/. For our purposes we will use the  Windows binaries since the sbt build tool requires additional configuration to work. Scala needs the Java JDK version 1.8 to install and the JAVA_HOME needs to point to the appropriate place. On my laptop this is:

JAVA_HOME=C:\Program Files (x86)\Java\jdk1.8.0_131

The Java version can be established using the command:

javac –version

My Scala version is 2.12.2, obtained using:

scala -version

Sbt is the dependency manager and build tool for Scala, it is a separate install from:

http://www.scala-sbt.org/0.13/docs/Setup.html

It is possible the PATH environment variable will need to be updated manually to include the sbt executables (:/c/Program Files (x86)/sbt/bin).

I am a big fan of Visual Studio Code, so I installed the Scala helper for Visual Studio Code:

https://marketplace.visualstudio.com/items?itemName=dragos.scala-lsp

This requires a modification to the sbt config file which is described here:

http://ensime.org/build_tools/sbt/

Then we can write a trivial Scala program like:

object HelloWorld {

 

def main(args: Array[String]): Unit = {

 

    println(“Hello, world!”)

 

  }

 

}

And run it at the commandline with:

scala first.scala

To use sbt in my workplace requires proxies to be configured. The symptom of a failure to do this is that the sbt compile command fails to download the appropriate dependencies on first run, as defined in a build.sbt file, producing a line in the log like this:

[error] Server access Error: Connection reset url=https://repo1.maven.org/maven2/net/
sourceforge/htmlcleaner/htmlcleaner/2.4/htmlcleaner-2.4.pom

In my case I established the appropriate proxy configuration from the Google Chrome browser:

chrome://net-internals/#proxy

This shows a link to the pacfile, something like:

http://pac.madeupbit.com/proxy.pac?p=somecode

The PAC file can be inspected to identify the required proxy, in my this case there is a statement towards the end of the pacfile which contains the URL and port required for the proxy:

if (url.substring(0, 5) == ‘http:’ || url.substring(0, 6) == ‘https:’ || url.substring(0, 3) == ‘ws:’ || url.substring(0, 4) == ‘wss:’)

{

return ‘PROXY longproxyhosturl.com :80’;

}

 

These are added to a SBT_OPTS environment variable which can either be set in a bash-like .profile file or using the Windows environment variable setup.

export SBT_OPTS=”-Dhttps.proxyHost=longproxyhosturl.com -Dhttps.proxyPort=80 -Dhttps.proxySet=true”

As a bonus, if you want to use Java’s Maven dependency management tool you can use the same proxy settings but put them in a MAVEN_OPTS environment variable.

Typically to start a new project in Scala one uses the sbt new command with a pointer to a g8 template, in my workplace this does not work as normally stated because it uses the github protocol which is blocked by default (it runs on port 9418). The normal new command in sbt looks like:

sbt new scala/scala-seed.g8

The workaround for this is to specify the g8 repo in full including the https prefix:

sbt new https://github.com/scala/scala-seed.g8

This should initialise a new project, creating a whole bunch of standard directories.

So far I’ve completed one small project in Scala. Having worked mainly in dynamically typed languages it was nice that, once I had properly defined my types and got my program to compile, it ran without obvious error. I was a bit surprised to find no standard CSV reading / writing library as there is for Python. My Python has become a little more functional as a result of my Scala programming, I’m now a bit more likely to map a function over a list rather than loop over the list explicitly.

I’ve been developing intensively in Python over the last couple of years, and this seems to have helped me in configuring my Scala environment in terms of getting to grips with module/packaging, dependency managers, automated doocumentation building and also in finding my test library (http://www.scalatest.org/) at an early stage.