Author's posts
Dec 30 2015
Book review: Risk assessment and Decision Analysis with Bayesian Networks by N. Fenton and M. Neil
As a new member of the Royal Statistical Society, I felt I should learn some more statistics. Risk Assessment and Decision Analysis with Bayesian Networks by Norman Fenton and Martin Neil is certainly a mouthful but despite its dry title it is a remarkably readable book on Bayes Theorem and how it can be used in risk assessment and decision analysis via Bayesian Networks.
This is the “book of the software”, the reader gets access to the “lite” version of the author’s AgenRisk software, the website is here. The book makes heavy use of the software in terms of presenting Bayesian networks and also in the features discussed. This is no bad thing, the book is about helping people who analyse risk or build models to do their job rather than providing a deeply technical presentation for those who might be building tools or doing research in the area of Bayesian Networks. With access to AgenRisk the reader can play with the examples provided and make a rapid start on their own models.
The book is divided into three large sections. The first six chapters provide an introduction to probability, and the assessment of risk (essentially working out the probability of a particular outcome). The writing is pretty clear, I think its the best explanation of the null hypothesis and p-values that I’ve read. The notorious “Monty Hall” problem is introduced. It then goes into Bayes’ theorem in more depth.
Bayes Theorem originates in the writings of Reverend Bayes published posthumously in 1763. It concerns conditional probability, that is to say the likelihood that a hypothesis H is true given evidence E written P(H|E). The core point being that we might have the inverse of what we want: an understanding of the likelihood of evidence given a hypothesis, P(E|H). Bayes Theorem gives us a route to calculate P(H|E) given P(E|H), P(E) and P(H). The second benefit here is that we can codify our prejudices (or not) using priors. Other techniques deny the existence of such priors.
Bayesian statistics are often put in opposition to “frequentist” statistics. This division is sufficiently pervasive that starting to type frequentist, Google autocompletes with vs Bayesian! There is also an xkcd cartoon. Fenton and Neil are Bayesians and put the Bayesian viewpoint. As a casual observer of this argument I get the impression that the Bayesian view is prevailing.
Bayesian networks are structures (graphs) in which we connect together multiple “nodes” of Bayes theorem. That’s to say we have multiple hypothesis with supporting (or not) evidence which lead to a grand “outcome” or hypothesis. Such a grand outcome might be the probability that someone is guilty in a criminal trial or that your home might flood. These outcomes are conditioned on multiple pieces of evidence, or events, that need to be combined. The neat thing about Bayesian Networks is that we can plug in what data we have to make estimates of things we don’t know – regardless of whether or not they are the “grand outcome”.
The “Naive Bayesian Classifier” is a special case of a Bayesian network where the nodes are all independent leading to a simple hub and spoke network.
Bayesian networks were relatively little used until computational developments in the 1980s meant that arbitrary networks could be “solved”. I was interested to see David Speigelhalter’s name appear in this context, arguably he is one of few publically recognisable mathematicians in the UK.
The second section, covering four chapters, goes into some practical detail on how to construct Bayesian networks. This includes recurring idioms in Bayesian Networks which they name the cause consequence idiom, measurement idiom, definitional/synthesis idiom and induction idioms. The idea is that when one addresses a problem, rather than starting with a blank sheet of paper, you select the appropriate idiom as a starting point. The typical problem is that the “node probability tables” can quickly become very large for a carelessly constructed Bayesian network, Risk assessment’s idioms help reduce this complexity.
Along with idioms this section also covers how ranked and continuous scales are handled, and in particular the use of dynamic discretization schemes for continuous scales. There is also a discussion of confidence levels which highlights the difference in thinking between Bayesians and frequentists, essentially the Bayesians are seeking the best answer given the circumstances whilst the frequentists are obsessing about the reliability of the evidence.
The final section of three chapters gives some concrete examples in specific fields: operational risk, reliability and the law. Of these I found the law examples the most pertinent. Bayes analysis fits very comfortably with legal cases, in theory, a legal case is about assigning a probability to the guilt or otherwise of a defendant by evaluating the strength (or probability that they are true) of evidence. In practice one gets the impression that faulty “commonsense” can prevail in emotive cases, and experts in Bayesian analysis are only brought in at appeal.
I don’t find this surprising, you only have to look at the amount of discussion arising from the Monty Hall problem to see that even “trivial” problems in probability can be remarkably hard to reason clearly about. I struggle with this topic myself despite substantial mathematical training.
Overall a readable book on a complex topic, if you want to know about Bayesian networks and want to apply them then definitely worth getting but not an entertaining book for a casual reader.
Dec 24 2015
Book review: Spark GraphX in Action by Michael S. Malak and Robin East
I wrote about Spark not so long ago when I reviewed Learning Spark, at the time I noted that Learning Spark did not cover the graph processing component of Spark, GraphX. Spark GraphX in Action by Michael S. Malak and Robin East fills that gap.
I read the book via Mannings Early Access Program (MEAP), they approached me and gave me access to the whole book for free, this meant I read it on my Kindle which I tend not to do these days for technical books because I still find paper a more flexible medium. Early Access means the book is still a little rough around the edges but it is complete.
The authors suggest that readers should be comfortable reading Scala code to enjoy the book. Scala is the language Spark is written in, and the best way to access GraphX. In fact access via Python (my favoured route) is impossible and using Java it sounds ugly. Scala is a functional language which runs on the Java virtual machine. It seems to be motivated by a desire to remove Java’s verbosity but perhaps goes a little too far. There is no `return` keyword for identifying the return value of a function. Its affectation is to overload the meaning of the underscore _. As it was I felt comfortable enough reading Scala code. I was interested to read that the two “variable” definitions are `val` and `var`, `val` is immutable and is preferred – var is mutable. This is probably a lesson for my Python programming – immutable “variables” can provide higher performance (and using immutable for things that you intend to be immutable aids clarity and debugging).
From the point of view of someone who has read about Spark and graph theory in the past the book is pitched at the right level, there is some introductory material about Spark and also about graph theory and then a set of examples. The book finishes with some material on inspecting running jobs in Spark using the Spark web interface. If you have never heard of Spark, then this book probably isn’t a good place to start.
The examples start with basic algorithms on measuring shortest paths across a graph, connectedness and the Page Rank algorithm on which Google was originally built. These are followed by simple implementations of some further algorithms including shortest paths with weighted edges (essential for route finding) and the travelling salesman problem. There then follows a chapter on some machine learning algorithms including recommendation engines, spam detection, and document clustering. Where appropriate the authors cite the original papers for algorithms including PageRank, Pregel (Google’s graph processing framework) and SVD++ (which was a key component of the winning entry for the Netflix recommendation prize) which is very welcome. The examples are outlines rather than full implementations of these sophisticated algorithms.
Finally, there is a chapter titled “The Missing Algorithms”, this is more a discussion of utility functions for GraphX in terms of import from other schemes such as RDF, operations such as merging two graphs or trimming away stray vertices.
The book gives the impression that GraphX is not ready for the big time yet, in a couple of places the authors said “this bit has only just started working”, and when they move on to talking about using SVD++ in GraphX they explain how the algorithm is only half implemented in GraphX. Full implementations are available in other languages.
It seemed to me on my original reading about Spark that the big benefit was that you could write machine learning systems in a familiar language which ran on a single machine in Spark, and then scale up effortlessly to a computing cluster, if required. Those benefits are not currently present in GraphX, you need to worry about coding in a foreign language and about the quality of the underlying implementation. It feels like the appropriate approach (for me) should be to prototype using Python/Neo4J, and likely discover that that is all that is needed. Only if you have a very large graph do you need to consider switching to a Spark based solution, and I’m not convinced GraphX is how you would do it even then.
The code samples are poorly formatted but you can fix this by downloading the source code and viewing it in the editor of your choice with nice syntax highlighting and consistent indenting – this makes things much clearer. The figures are clear enough but I find the Kindle approach of embedding thumbnail scale figures unhelpful – you need to double click them to make them readable. A reasonable solution would be to make figures full page by default, if that is possible.
This is one of the better “* in Action” books I’ve read, it’s not convinced me to use GraphX – quite the reverse – but that’s no bad thing and I’ve learnt a little about recommender algorithms and Scala.
Dec 17 2015
Book review: The Invention of Nature by Andrea Wulf
The Invention of Nature by Andrea Wulf is subtitled The Adventures of Alexander von Humboldt – this is his biography.
Alexander von Humboldt was born in Berlin in 1769, he died in 1859. The year in which On the Origin of Species was published. He was a naturalist of a Romantic tendency, born into an aristocratic family, giving him access to the Prussian court.
He made a four year journey to South America in 1800 which he reported (in part) in his book Personal Narratives, which were highly influential – inspiring Charles Darwin amongst many others. On this South American trip he made a huge number of observations across the natural and social sciences and was sought after by the newly formed US government as the Spanish colonies started to gain independence. Humboldt was a bit of a revolutionary at heart, looking for the liberation of countries, and also of slaves. This was one of his bones of contention with his American friends.
His key scientific insight was to see nature as an interconnected web, a system, rather than a menagerie of animals created somewhat arbitrarily by God. As part of this insight he saw the impact that man made on the environment, and in some ways inspired what was to become the environmentalist movement.
For Humboldt the poetry and art of his observations were as important as the observations themselves. He was a close friend of Goethe who found him a great inspiration, as did Henry David Thoreau. This was at the time when Erasmus Darwin was publishing his “scientific poems”. This is curious to the eye of the modern working scientist, modern science is not seen as a literary exercise. Perhaps a little more effort is spent on the technical method of presentation for visualisations but in large part scientific presentations are not works of beauty.
Humboldt was to go voyaging again in 1829, conducting a whistle-stop 15,000 mile 25 week journey across Russia sponsored by the government. On this trip he built on his earlier observations in South America as well as carrying out some mineral prospecting observations for his employers.
Despite a paid position in the Prussian court in Berlin he much preferred to spend his time in Paris, only pulled back to Berlin as the climate in Paris became less liberal and his paymaster more keen to see value for money.
Personally he seemed to be a mixed bag, he was generous in his support of other scientists but in conversation seems to have been a force of nature, Darwin came away from a meeting with him rather depressed – he had not managed to get a word in edgewise!
I’m increasingly conscious of how the climate of the time influences the way we write about the past. This seems particularly the case with The Invention of Nature. Humboldt’s work on what we would now call environmentalism and ecology are highly relevant today. He was the first to talk so explicitly about nature as a system, rather than a garden created by God. He pre-figures the study of ecology, and the more radical Gaia Hypothesis of James Lovelock. He was already alert to the damage man could do to the environment, and potentially how he could influence the weather if not the climate. There is a brief discussion of his potential homosexuality which seems to me another theme in keeping with modern times.
The Invention of Nature is sub-subtitled “The Lost Hero of Science”, this type of claim is always a little difficult. Humboldt was not lost, he was famous in his lifetime. His name is captured in the Humboldt Current, the Humboldt Penguin plus many further plants, animals and geographic features. He is not as well-known as he might be for his theories of the interconnectedness of nature, in this area he was eclipsed by Charles Darwin. In the epilogue Wulf suggests that part of his obscurity is due to anti-German sentiment in the aftermath of two World Wars. I suspect the area of the “appropriate renownedness of scientific figures of the past” is ripe for investigation.
The Invention of Nature is very readable. There are seven chapters illustrating Humboldt’s interactions with particular people (Johann Wolfgang von Goethe, Thomas Jefferson, Simon Bolivar, Charles Darwin, Henry David Thoreau, George Perkins Marsh, Ernst Haeckel and John Muir). Marsh was involved in the early environmental movement in the US, Muir in the founding of the Yosemite National Park (and other National Parks). At first I was a little offended by this: I bought a book on Humboldt, not these other chaps! However, then I remembered I actually prefer biographies which drift beyond the core character and this approach is very much in the style of Humboldt himself.
Nov 24 2015
Parsing XML and HTML using xpath and lxml in Python
For the last few years my life has been full of the processing of HTML and XML using the lxml library for Python and the xpath query language. xpath is a query language designed specifically to search XML, unlike regular expressions which should definitely not be used to process XML related languages. Typically this has involved a lot of searching my own code to remind me how to do stuff. This blog post captures some handy snippets to avoid the inevitable Googling, and solidify for me exactly what I’ve been doing for the last few years!
But what does it {xml, html} look like?
xml and html are made up of “elements”, delimited by pointy brackets and attributes which are equal to things:
<element1 attribute1=”thing”>content</element1>
Elements can be nested inside other elements to make a tree structure. A wrinkle to be aware of is the so-called “tail” of an element. This is most often seen with <br/> tags (I think it is general):
<element1 attribute1=”thing”>content</br>tail</element1>
The “content” is accessed using text(), whilst the tail is accessed using .tail.
Web pages are made from HTML which is a “relaxed” XML format. XML is the basis of many other file formats found in the wild (such as GPX and GML). Dealing with XML is very similar to dealing with HTML except for namespaces, which I discuss in more detail at the end of this post.
XPath Helper
Before I get onto xpath I should introduce xpath helper – which is a plugin for Google Chrome which helps you develop xpath queries.
You can find XPath Helper in the Chrome Store, it is free. I use it in combination with the Google Chrome Developer tools, particular the “Inspect Element” functionality. XPath helper allows you to see the results of an xpath query live. You open up the XPath console (Ctrl+shift+x), type in your xpath and you see the results in both in the xpath helper console, and also as highlighting on the page.
You can get automatically generated xpath queries, however typically I have used these just as inspiration since they tend to be rather long and “brittle”.
Loading up the data
My Python scripts nearly always start with the following imports:
import lxml.html import requests import requests_cache requests_cache.install_cache('demo_cache')
requests and requests_cache to access data on the web and lxml.html to parse the HTML. Then I can get a webpage using:
r = requests.get(url) root = lxml.html.fromstring(r.content)
You might want to make any URLs absolute rather than relative:
root.make_links_absolute(base_url)
If I’m dealing with XML rather than HTML then I might do:
from lxml import etree
And then when it came to loading in a local XML file:
with open(input_file, "rb") as f: root = etree.XML(f.read())
XPath queries
With your root element in hand you can now get on with querying. Xpath queries are designed to extract a set of elements or attributes from an XML/HTML document by the name of the element, the value of an attribute on an element, by the relationship an element has with another element or by the content of an element.
Quite often xpath will return elements or lists of elements which, when printed in Python, don’t show you the content you want to see. To get the text content of an element you need to use .text, text_content(), or .tail, and make sure you ask for an array element rather than the whole array.
The follow examples show the key features of xpath. I’m using this blog (http:/www.ianhopkinson.org.uk/) as an example website so you can play along with xpath:
Specifying a complete path with / as separator
title = root.xpath('/html/body/div/div/div[2]/h1')
is the full path to my blog title. Notice how we request the 2nd element of the third set of div elements using div[2] – xpath arrays are one-based, not zero-based.
Specifying a path with wildcards using //
This expression also finds the title but the preamble of /html/body/div/div is absorbed by the // wildcard match:
title = root.xpath('//div[2]/h1')
To obtain the text of the title in Python, rather than an element object, we would do:
title_text = title[0].text.strip()
or maybe title_text = title[0].text_content().strip()
text_content() would pick up any tail content, and any text in child elements. I use strip() here to remove leading and trailing whitespace
Selecting attribute values
we’ve seen that //element selects all of the elements of type “element”. We select attribute values like this:
ids = root.xpath('//li/@id')
which selects the id attribute from the list elements (li) on my blog
Specifying an element by attribute
We can select elements which have particular attribute values:
tagcloud = root.xpath('//*[@class="tagcloud"]')
this selects the tag cloud on my blog by selecting elements which having the class attribute “tagcloud”.
Select an element containing some specified text
We can do something similar with the text content of an element:
title = root.xpath(‘//h1[contains(., ‘SomeBeans’)]’)
This selects h1 elements which contain the text “SomeBeans”.
Select via a parent or sibling relationship
Sometimes we want to select elements by their relationship to another element, for example:
subtitle = root.xpath('//h1[contains(@class,"header_title")]/../h2')
this selects the h1 title of my blog (SomeBeans) then navigates to the parent with .. and selects the sibling h2 element (the subtitle “the makings of a small casserole”).
The same effect can be achieved with the following-sibling keyword:
subtitle = root.xpath('//
h1[contains(@class,"header_title")]/following-sibling::h2')
XML Namespaces
When dealing with XML, we need to worry about namespaces. In principle the elements of an XML document are described in a schema which can be looked up and is universally unique. In practice the use of namespaces in XML documents can lead to much banging head against wall! This is largely because trivial examples of XML wrangling don’t use namespaces, except as a “special” example.
Here is a fragment of XML defining two namespaces:
<foo:Results xmlns:foo="http://www.foo.com" xmlns="http://www.bah.com">
xmlns:foo defines a namespace whose short form is “foo”, we select elements in this space using a namespace parameter to the xpath query:
records = root.xpath('//foo:Title', namespaces = {"foo": "http://www.foo.com"})
The “catch” here is we also define a default namespace xmlns = “http://www.bah.com”, which means that elements which don’t have a prefix cannot be selected unless we define the namespace in our xpath:
records = root.xpath('//bah:Title', namespaces = {"bah": http://www.bah.com})
Worse than that we need to include our namespace prefix in the query, even though it doesn’t appear in the file!
Conclusion
These snippets cover the majority of the xpath queries I’ve needed over the past few years, I’ll add any others as I find them. I’ve put all the code used here in a GitHub gist.
Xpath is the right tool for the job of extracting information from XML documents, including HTML – do not accept inferior alternatives!
Nov 14 2015
Book Review: Canals: The making of a nation by Liz McIvor
Canals: The making of a Nation by Liz McIvor is a tie-in with a BBC series of the same name, presented by the author. It is about canals in England from the mid-18th century through to the present day although most of the action takes place before the end of the 19th century.
The chapters of the book match the episodes of the series which are thematic, rather than chronological. Each chapter introduces a different topic, loosely tied to a particular canal.
The book starts with a discussion of the growth of London, and the Grand Junction canal linking it to Birmingham. The guild system was a factor in limiting the growth of the capital until the mid-18th century. The “Bubble Act” of 1720, enacted in the aftermath of the South Sea Bubble likely also had an impact. It prevented the formation of any joint stock company without an act of parliament to approve. It was repealed in 1825 before the railways saw their enormous growth. The Grand Junction canal was built as Birmingham became a manufacturing hub and London a great city with many requirements for daily life, and also a showroom to at least the United Kingdom, if not the world.
I was chastened to discover that the Bridgewater canal, one of the earliest of “canal boom” projects of the 18th century is only just up the road from me in Chester. I’d always assumed it was close to the town of a very similar name in Somerset! Bridgewater is named for the Duke of Bridgewater, Francis Egerton, for which the pub just over the road from me is presumably named. The Bridgewater canal was built around 1760, linking the Duke’s coal mines at Worsley to Manchester. With this revelation I realise that the Bridgewater canal and the Liverpool to Manchester railway, the first exclusively steam railway, are sited very close to each other.
Support for manufacture was the theme of canal building in the North of England, and also around Birmingham with canals built to move bulky raw materials to factories placed to benefit from hydraulic power, and benevolent climates for the processing of materials such as cotton. Manufacturers such as Josiah Wedgewood were keen to see their fragile wares safely make the outward journey to the showrooms of London.
The Kennet and Avon canal was built to provide navigable water access from Bristol to London. William Smith, who produced the first geological map of Great Britain is introduced in this chapter. I read more about him in The Map that Changed the World by Simon Winchester. The digging of canal cuts and tunnels reveal the local geology. Nowadays we see canals as bucolic thoroughfares but when they were built they were raw cuts indicating industrialisation.
The Manchester Ship Canal was opened in 1894 to bypass the port of Liverpool, these were the dying days of canal building. 154 died in its construction and 1404 were seriously injured from a workforce of 16,361. For comparison, projects such as the 2012 London Olympics and the close-to-completion Crossrail project are of similar scale yet have casualty numbers hovering around zero although these are best-in-class projects for health and safety. In this chapter McIvor talks more of the Irish “navigators” who built the canals, and something of the early trade union movement.
The families that worked the canals were seen as outsiders, once the long networks were set up they led an itinerant lifestyle with no fixed church or school for their children. The Victorian moralists arguing for improved conditions for the boat families seem to do so from the point of view of pointing out how bloody awful they were!
It’s interesting to see the likes of Thomas Telford and John Rennie cropping up repeatedly in this book. They have the air of rockstar engineers, not a niche found these days. Perhaps this is a result of the work of the Victorian writer, Samuel Smiles, who was very keen on self-improvement and wrote biographies of these men to promote his ideas.
To me the book lacks a little prehistory, the great boom for canal building in the UK was at the end of the 18th century but the very first “pound lock” in England was built in 1566 on the Exeter canal. What went on between these two times? And what was happening elsewhere in the world? Perhaps the answer here is that the canals in Britain never represented a technological revolution, they were always about the social and commercial climate being right.
Canals: The Making of a Nation is an unchallenging read, well-suited to a holiday. If you’re on a canal boat it won’t tell you much about the particular bridges and tunnels you pass over but it will give you a strong feeling for the lives of the people that built and used the canals, and why they were built in the first place.