Author's posts
Apr 26 2018
Book review: Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron
I’ve recently started playing around with recurrent neural networks and tensorflow which brought me to Hands-on machine learning with scikit-learn & tensorflow by Aurélien Géron, as a bonus it also includes material on scikit-learn which I’ve been using for a while.
The book is divided into two parts, the first, “Fundamentals of Machine Learning” focuses on the functionality which is found in the scikit-learn library. It starts with a big picture, running through the types of machine learning which exist (supervised / unsupervised, batched / online and instance / model) and then some of the pitfalls and problems with machine learning before a section on testing and validation. The next part is a medium sized example of machine learning in action which demonstrates how the functionality of scikit-learn can be quickly used to develop predictions of house prices in California based on census data. This is a subject after my own heart, I’ve been working property data in the UK for the past couple of years.
This example serves two purposes, firstly it demonstrates the practical steps you need to take when undertaking a machine learning exercise and secondly it highlights just how concisely much of it can be executed in scikit-learn. The following chapters then go into more depth first about how models are trained and scored and then going into the details of different algorithms such as Support Vector Machines and Decision Trees. This part finishes with a chapter on ensemble methods.
Although the chapters contain some maths, their strength is in the clear explanations of the methods described. I particularly liked the chapter on ensemble methods. They also demonstrate how consistent the scikit-learn library is in its interfaces. I knew that I could switch algorithms very easily with scikit-learn but I hadn’t fully appreciated how the libraries generally handled regression and multi-class classification so seamlessly.
I wonder whether outside data science it is perceived that data scientists write their own algorithms from scratch. In practical terms it is not the case, and hasn’t been the case since at least the early nineties when I started data analysis which looks very similar to the machine learning based analysis I do today. In those days we used the NAG numerical library, Numerical Recipes in FORTRAN and libraries developed by a very limited number of colleagues in the wider academic community (probably shared by email attachment).
The second part of the book, “Neural networks and Deep Learning”, looks at the tensorflow library. Tensorflow has general applications for processing multi-dimensional arrays but it has been designed with deep learning and neural networks in mind. This means there are a whole bunch of functions to generate and train neural networks of different types and different architectures.
The section starts with an overview of tensorflow with some references to other deep learning libraries, before providing an introduction to neural networks in general, which have been around quite a while now. Following this there is a section on training deep learning networks, and the importance of the form of activation functions.
Tensorflow will run across multiple processors, GPUs and/or servers although configuring this looks a bit complicated. Typically a neural network layer can’t be distributed over multiple processing units.
There then follow chapters on convolution neural networks (good for image applications), recurrent neural networks (good for sequence data), autoencoders (finding compact representations) and finally reinforcement learning (good for playing pac-man). My current interest is in recurrent neural networks, it was nice to see a brief description of all of the potential input/output scenarios for recurrent neural networks and how to build them.
I spent a few years doing conventional image analysis, convolution neural networks feel quite similar to the convolution filters I used then although they stack more layers (or filters) than are normally used in conventional image analysis. Furthermore, in conventional image analysis the kernels are typically handcrafted to perform certain tasks (such as detect horizontal or vertical edges), whilst neural networks learn their kernels in training. In conventional image analysis convolution is often done in Fourier space since it is more efficient and I see there experiments along these lines with convolution neural networks.
Developing and training neural networks has the air of an experimental science rather than a theoretical science. That’s to say that rather than thinking hard and coming up with an effective neural network and training scheme one needs to tinker with different designs and training methods and see how well they work. It has more the air of training an animal the programming a computer. There are a number of standard training / test sets of images and successful models trained against these by other people can be downloaded. Such models can be used as-is but alternatively just parts can be used.
This section has many references to the original literature for the development of deep learning, highlighting how recent this new iteration of neural networks is.
Overall an excellent book, scikit-learn and tensorflow are the go-to libraries for Python programmers wanting to do machine learning and deep learning respectively. This book describes their use eloquently, with references to original literature where appropriate whilst providing a good overview of both topics. The code used in the book can be found on github, as a set of Jupyter Notebooks.
Mar 18 2018
Book review: The Philosophical Breakfast Club by Laura J. Snyder
The Philosophical Breakfast Club by Laura J. Snyder is an ensemble biography of William Whewell (pronounced: who-ell), Charles Babbage, Richard Jones and John Herschel who were all born towards the end of the 18th century and died in the later half of the 19th century. Their joint project was the professionalisation of science.
The pattern for their reform was Francis Bacon’s New Atlantis published in 1626 which fictionalised a government-funded science institution whose work was for the public good, and whose philosophical basis was the systematic collection of facts, including experimentation, from which scientific theories would be gleaned by induction. As such they follow in the footsteps of the founding fathers of the Royal Society who also took Bacon as their guiding light.
By the earl years of the 19th century the Royal Society had drifted in its purpose since its founding, it was more a gentlemen’s dining club than a scientific society with your position in society a more important factor than your scientific achievements in gaining entry.
Prior to reading this book I recognised the names of Whewell, famous for coining the term “scientist”, Herschel and Babbage – the former the son of William Herschel, the astronomer, the latter the inventor of the Difference Engine. I also knew that Herschel and Babbage had been involved in attempts to reform the Royal Society.
Richard Jones was unknown to me. His contributions were in the foundations, or at least building, of the field of economics. In particular he proposed an economics based on induction, that is to say one should go out and collect facts about the economy and from that point infer rules about the operation of economies from the data. The alternative is to hypothesis some simple rules, and elaborate the consequences of those rules – this is known as deduction. In economics Ricardo and Malthus had been early proponents of this deductive method. Jones went on to become one of the commissioners under the Tithe Commutation Act 1836 which converted the payments in kind of the old tithe system into what was effectively a local tax.
Babbage, Jones and Herschel all came from moderately wealthy backgrounds for whom the path to Cambridge University was relatively smooth. Whewell, on the other hand, was the son of a carpenter which although a respectable trade would not fund attendance at the university. Whewell was educated at a grammar school in Lancaster, his home town, as a result of being spotted by the local gentry who also smoothed his path into Cambridge. This appears to be the route by which the lower middle class entered university – chance encounters.
The four men met at Cambridge University where they formed the Philosophical Breakfast Club. It was at a time when gathering together and discussing politics was seen as borderline seditious. It was not long after the French revolution and the Great Reform Act was yet to come. They corresponded throughout the rest of their lives but there is no feeling from the book that their collaboration to change the face of science was at all formal (or even subject to an overall plan).
At Cambridge Babbage and Whewell were responsible for driving the use of Leibniz’s notation for calculus, in place of Newton’s notation to which the university had adhered for some time. Leibniz’s notation is the one in use today, generally it is seen as clearer than the Newtonian version and more amenable to extension.
Babbage post-Cambridge started work on mechanical computing, managing to extract large quantities of money from the government for this work, exceptional at the time, although he did not deliver a working device. The first Difference Engine was designed to calculate mathematical tables. The later Analytical Engine was very much like modern computers in its architecture. Neither of these devices were ever fully constructed. Babbage could best be described as a mathematician which put him into some conflict with others in the Breakfast Club since mathematics is rather more deductive than inductive in its basis. Later in his life he seems to have become involved in codebreaking, quite possibly for the government, although the evidence for this is circumstantial.
Babbage also led a ferocious attack on the Royal Society in his book Reflections on the decline of science in England. The British Association for the Advancement of Science (BAAS) followed on from this although Babbage, Herschel and Whewell did not attend its first meeting. The BAAS annual meetings became rather large, and there was muttering at the time about the attendees penchant for fine dining. Unlike the Royal Society, it was open to all, even women! I was interested to read about the foundation of my own professional society, the Royal Statistical Society. It started as a section of the British Association for the Advancement of Science where it proved contentious because it was concerned in the collection and analysis of social data which surely leads to politics. Babbage and Jones set up the London Statistical Society which was to become the Royal Statistical Society.
After Cambridge Herschel spent some time in South Africa measuring the location of stars in the southern skies, following on the family business. He became president of the Royal Astronomical Society and published several books on astronomy as well as star catalogues. As well as this he was involved in the development of photography, he was an enthusiastic chemical experimenter and appears to have guided Henry Fox Talbot in fixing his early photographic images.
Whewell remained at Cambridge University for the rest of his life, where he later became the Master of Trinity College. As well as his efforts in changing the teaching of calculus he introduced the Natural Sciences Tripos (parts of which I have taught). His publications were mainly in the history and philosophy of science. He was involved in some scientific endeavours – the measurement and analysis of the tides, for example. Although he coined the term “scientist” in 1833 it wasn’t to gain much currency until much later in the century.
Snyder identifies the period 1820-70 as one where there was a great transition in science from being a gentleman’s hobby to a (sort of) mass participation activity with at least some regard for practical application, a defined career path at least for a few and some more regular government funding.
I found The Philosophical Breakfast Club very readable. It covers a period of great transition in science in the UK, and makes a nice companion to Henrietta Heald’s biography of William Armstrong.
Feb 10 2018
Book review: William Armstrong–Magician of the North by Henrietta Heald
A return to industrial history with William Armstrong: Magician of the North by Henrietta Armstrong. Armstrong was a 19th century industrialist who spent his life in the north-east of England around Newcastle. His great industrial innovation was the introduction of hydraulic power to cranes and the like. His great wealth, and honours (a knighthood and then a baronetcy) derived from his work in the invention and sale of armaments principally artillery and ships. His home, Cragside near Rothbury some 30 miles north of Newcastle upon Tyne, was the first to feature electric lighting amongst many other technical innovations.
Armstrong was a contemporary of Robert Stephenson, Isambard Kingdom Brunel and Joseph Whitworth – they were all born near the beginning of the 19th century, Armstrong dying in 1900 outlasted them all with Brunel and Robert Stephenson dying in 1859.
Armstrong was born in 1910 his parents started him on a career in the law. However, he had always been fascinated by water. This led to his realisation that the power that could be extracted from a head of water in a sealed system. A water wheel extracts energy from water falling the height of the wheel, a matter of a few metres. A sealed iron pipe, such as could now be manufactured allowed you to capture the energy from a fall of tens of metres or more. In Newcastle upon Tyne the local landscape could provide this head of pressure but with a little ingenuity the head of pressure could be created with a steam engine or other mechanical means. This energy could be used to drive all manner of machinery, Armstrong initially used it to power cranes, and lock gates, to be used in docks and the many factories springing up around the country. Ultimately his hydraulic mechanisms drove London’s Tower Bridge.
In the aftermath of the Crimean War, Armstrong switched his attention to building artillery. During the Crimean War the British artillery was found wanting in terms of accuracy, destructive power and firing rate. His innovations were to move from cannonballs to shells (shaped like bullets), and from muzzle loading to breech loading. He gave up the patents for his artillery pieces to the government but made a fair business on them. His activities with ordnance led to his knighthood and baronetcy although ultimately he withdraw from the close relationship with the British government in armaments as a result of political manoeuvrings by competitors.
The manufacture of artillery led to the manufacture of warships, which incidentally also carried the artillery. The Japanese Navy were particularly important.
He was a leading light of the Literary and Philosophical Society of Newcastle upon Tyne (Lit & Phil), and contributed to founding what is now Newcastle University. Late in his life, in 1897, he published Electric movement in air and water based on his experiments and featuring cutting-edge photographs of the phenomena he described. From a scientific point of view, Armstrong is not a name you will hear in physics classrooms (at any level) today – I don’t know if the same holds for his engineering innovations. Also late in his life he bought Bamburgh Castle, and spent a fair amount of money refurbishing it.
Magician of the North is a somewhat sympathetic view of Armstrong, along the lines of Man of Iron by Julian Glover about Thomas Telford. This contrasts with Samuel Smiles biography of George Stephenson and Rolt’s of Brunel which are much more effusive about their subjects. The Armstrong’s arms trading is discussed in some detail, it seems the company sailed somewhat close to the wind legally in supplying both sides in the American Civil War. A second blemish on Armstrong’s reputation came from industrial disputes with his, and other workers on the Tyne, asking for shorting working hours. That said, he was clearly a pillar of the Newcastle and north eastern community and highly regarded by most of the people most of the time. Many buildings in Newcastle bore his name as a result of his donations both whilst he was a live and after he died.
As usual the author of this biography bemoans the limited attention their subject has received. In the case of Armstrong they put this down to his extensive involvement in the arms trade which, never the most popular, was to fall further out of favour following the Great War. I’ve never seen a quantitative analysis of what makes the right amount of attention for figures in the history of science and technology.
William Armstrong died in 1900, after his death his company went into a slow decline. The Great War led to a distaste for the arms trade, and then came the Great Depression. With Armstrong gone there was no strong, capable leader for the company. The Armstrong name lived on in various spin off companies such as Armstrong Siddeley and various amalgamations with Whitworths and Vickers.
Jan 08 2018
Book reviews: Christmas Extravaganza!
I’ve given up writing my full length posts for my Christmas book haul which, I show below, was rather fine.This is in part a result of the type of books one gets for Christmas, and in part the conditions under which they are read – in a Christmas cake induced haze, for my part.
A Philosopy of Walking by Frédéric Gros
A Philosopy of Walking by Frédéric Gros was, like the best presents, something I wouldn’t have got for myself but nevertheless enjoyed.
The book interleaves chapters on various walking related thoughts with some walking oriented biographical content. This covers Nietzsche, Rimbaud, Rousseau, Nerval, Kant and Ghandi. The predominant feeling from these biographies is a bit grim, several of the protagonists died young or after prolonged illness – Nerval committed suicide. Their walking in feels compulsive. Ghandi lived to a ripe old age but was ultimately assassinated. Kant took the same walk every day, he lived to 80 but it sounds pretty dull!
The chapter on pilgrimage struck a cord with me, I’ve been meditating for a while which often involves focusing on a mantra or physical manifestation, like breathing. Some pilgrims take a similar approach combining walking with a prayer-like mantra.
Somehow the author has missed our family favourite walking habit – humming the Imperial March from Star Wars as a rhythm to walk to along broad well-made paths in the Lake District.
The book is translated from French, I learned this on reading in a footnote that the French word témoin which I knew meant “witness” also refers to the baton in a relay race.
New views by Alastair Bonnett
New Views by Alastair Bonnett is a different manner of Christmas book, a coffee table book – as are the rest of the books in this post.
New Views is a collection of world maps illustrating different data in three broad areas which could be described as physical, human and animal, and trade. The pattern is the same in each case – a double page contains a map with key, and on the following double page is some text describing the context of the map and another, different graphic. The maps are very much on the global scale, cities may be mentioned here and there but the overwhelming impression is of the world as a whole, not individual countries.
I liked the map of lightning strikes which highlights odd areas, particularly in the east of the Democratic Republic of Congo which has the highest rate of lightning strikes in the world. The maps of amphibian and bird diversity are fun too – they map out features of the underlying geography like rivers and mountains but in different ways.
I was surprised to learn just how big an exporter of nuts the US was, I should have known this since I constant in my last job was the monthly scrape of the reports of the almond board of California for a customer. Also I learned that Brazil exports no Brazil nuts because they don’t grow there!
Sometimes the colour keys are a bit cryptic, that’s to say I couldn’t distinguish between two categories on the scale. On another map countries where there is no data are omitted completely which makes the map difficult to parse unless you have a photographic recall of the shapes of the countries of the world. I was puzzle to learn that the viper was the only poisonous snake in the United Kingdom – I always called them “adders”.
This is a creditable work of this genre.
Bird by Andrew Zuckerman
Bird by Andrew Zuckerman is an immense book, comprised entirely of photographs of birds shown against a pure white background. There are a few words, and a pictorial index which names the birds, at the back but the main body of the text is completely wordless.
The pictures are gorgeous but I found myself wanting more having flicked through to the end of the book. The style is intentional and is contrasted with that of Audubon who included much more context in his famous paintings of birds. Technically the photographs are very good to excellent.
Zuckerman has produced a number of books in this style, I’m most interested to see his works on flowers and creatures.
Vermeer: The Complete Works by Karl Schütz
Vermeer: The Complete Works by Karl Schütz. I was surprised to read that there are only 35 works attributed to Vermeer. This may be because he fell out of popularity after his death in the 17th century and interest was not revived until the 19th century.
Canaletto & the art of Venice by Rosie Razzall and Lucy Whitaker
Canaletto & the art of Venice by Rosie Razzall and Lucy Whitaker. The authors names are very discreetly displayed on this volume. I’m a fan of Canaletto – I love the almost CAD-like precision of his architectural paintings.
Dec 28 2017
Book review: Fraud analytics by B. Baesens, V. Van Vlasselaer and W. Verbeke
This next book is rather work oriented: Fraud Analytics using descriptive, predictive and social network techniques: A guide to data science for fraud detection by Bart Baesens, Veronique van Vlasselaer and Wouter Verbeke.
Fraud analytics starts with an introductory chapter on the scale of the fraud problem, and some examples of types of fraud. It also provides an overview of the chapters that are to come. In the UK fraud losses stand at about £73 billion per annum, typically fraud losses are anything up to 5%. There are many types of fraud: credit card fraud, insurance fraud, healthcare fraud, click fraud, identity theft and so forth.
There then follows a chapter on data preparation, sampling and preprocessing. This includes some domain related elements such as the importance of the so-called RFM attributes: Recency, Frequency, and Monetary which are the core variables for financial transactions. Also covered are missing values and data quality which are more general issues in statistics.
The core of the book is three long chapters on descriptive statistics, predictive analysis and social networks.
Descriptive statistics concerns classical statistical techniques such as the detection of outliers using the z-score (the normalised standard deviation), through the clustering techniques such as k-means or related techniques. These clustering techniques fall into the category of unsupervised machine learning. The idea here is that fraudulent transactions are different to non-fraudulent ones, this may be a temporal separation (i.e. a change in customer behaviour may indicate that their account has been compromised and used nefariously) or it might be a snapshot across a population where fraudulent actors have different behaviour than non-fraudulent ones. Clustering techniques and outlier detection seek to identify these “different” transactions, usually for further investigation – that’s to say automated methods are used as a support for human investigators not a replacement. This means that ranking transactions for potential fraud is key. Obviously fraudsters are continually adapting their behaviour to avoid standing out, and so fraud analytics is an arms-race.
Predictive analysis is more along the lines of regression, classification and machine learning. The idea here is to develop rules for detecting fraud from training sets containing example transactions which are known to be fraudulent or not-fraudulent.Whilst not providing an in depth implementation guide Fraud Analytics gives a very good survey of the area. It discusses different machine learning algorithms, including their strengths and weaknesses particularly with regard to model “understandability”. Also covered are a wide range of model evaluation methods, and the importance of an appropriate training set. A particular issue here is that fraud is relatively uncommon so care needs to be taken in sampling training sets such that algorithms have a chance to identify fraud. These are perennial issues in machine learning and it is good to see them summarised here.
The chapter on social networks clearly presents an active area of research in fraud analytics. It is worth highlighting here that the term “social” is meant very broadly, it is only marginally about social networks like Twitter and Facebook. It is much more about networks of entities such as the claimant, the loss adjustor, the law enforcement official and the garage carrying out repairs. Also relevant are networks of companies, and their directors set up to commit corporate frauds. Network (aka graph) theory is the appropriate, efficient way to handle such systems. In this chapter, network analytic ideas such as “inbetweeness” and “centrality” are combined with machine learning involving non-network features.
The book finishes with chapters on fraud analytics in operation, and a wider view. How do you use these models in production? When do you update them? How do you update them? The wider view includes some discussion of data anonymisation prior to handing it over to data scientists. This is an important area, data protection regulations across the EU are tightening up, breaches of personal data can have serious consequences for those companies involved. Anonymisation may also provide some protection against producing biased models i.e those that discriminate unfairly against people on the basis of race, gender and economic circumstances. Although this area should attract more active concern.
A topic not covered but mentioned a couple of times is natural language processing, for example analysing the text of claims against insurance policies.
It is best to think of this book as a guide to various topics in statistics and data science as applied to the analysis of fraud. The coverage is more in the line of an overview, rather than an in depth implementation guide. It is pitched at the level of the practitioner rather than the non-expert manager. Aside from some comments at the end on label-based security access control (relating to SQL) and some screenshots from SAS products it is technology agnostic.
Occasionally the English in this book slips from being fully idiomatic, it is still fully comprehensible – it simply reads a little oddly. Not a fun read but an essentially starter if you’re interested in fraud and data science.