Category: Book Reviews

Reviews of books featuring a summary of the book and links to related material

Book review: Mining the Social Web by Matthew A. Russell

mining_the_social_web_cover

This review was first published at ScraperWiki.

The twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.

In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.

The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!

Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.

I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.

The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.

Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.

Book review: Data Mining – Practical Machine Learning Tools and Techniques by Witten, Frank and Hall

datamining

This review was first published at ScraperWiki.

I’ve been doing more reading on machine learning, this time in the form of Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten, Eibe Frank and Mark A. Hall. This comes by recommendation of my academic colleagues on the Newsreader project, who rely heavily on machine learning techniques to do natural language processing.

Data mining is about finding structure in data, the algorithms for doing this are found in the field of machine learning. The classic example is Iris flower dataset. This dataset contains measurements of parts of a flower for three different species of Iris, the challenge is to build a system which classifies a flower to its species by its measurements. More practical examples are in the diagnosis of machine faults, credit assessment, detection of oil slicks, customer support analysis, marketing and sales.

Previously I’ve reviewed Machine Learning in Action by Peter Harrington. Data Mining is a somewhat different book. The core contents are quite similar: background to machine learning, evaluating your results and a run through the core algorithms. Machine Learning in Action is a pretty quick run through the field touching on many subjects, with toy demonstrations built from scratch in Python. Data Mining, running to almost 600 pages, is a much more thorough reference. There is a place for both types of book, even on the same bookshelf.

Data Mining is written by three members of the University of Waikato’s Computer Science Department and is based around the Weka machine learning system developed there. Weka is a complete framework, written in Java, which implements the algorithms described in the book as well as some others. Weka can be accessed via the command-line or using a GUI. As well as the machine learning algorithms there are systems for preparing data, evaluating and visualising results. A collection of well-known demonstration data sets are included. I’ve no reason to doubt the quality of the implementations in Weka, the GUI interface is functional, occasionally puzzling and not particularly slick. The book stands alone from the Weka framework but the framework provides a good playground to try out the techniques discussed in the book. Weka seems to be entirely suitable for conducting serious analysis. This approach is in contrast to the approach of Harrington in Machine Learning in Action who provides toy implementations of algorithms in Python.

The first two parts of the book provide an overview of machine learning, followed by a more detailed look at how the key algorithms are implemented. The third section is dedicated to Weka, whilst the first two sections refer to it but do not rely on it. The third section is divided into a discussion of Weka, covering all its key features and then a tutorial. I found this a bit confusing since the first part has the air of a tutorial, but isn’t, and the tutorial part keeps referring back to the overview section for its screenshots.

With some knowledge already in machine learning, the things I learned from this book:

  • better methods, and subtleties in measuring the performance of machine learning algorithms;
  • the success of the one-rule algorithm, essentially a decision tree which gets the maximum benefit from a single rule. It turns such an approach is surprisingly effective and only bettered a little, if at all, by more sophisticated algorithms;
  • getting enough, clean data to do machine learning is often a problem;
  • where to learn more!

The first edition of this book was published in 1999; my review is of the third edition. The book does show some signs of age, Machine Learning in Action  was written as a response to a poll published at the International Conference on Data Mining 2006 on the 10 most important machine learning algorithms (see the paper here). Whilst Data Mining mentions this survey, it is as something of an afterthought and the authors seem bemused by the inclusion of the PageRank algorithm used by Google to rank web pages in search results. They mention the Moa framework for data stream mining although do not discuss it in any detail. Moa focuses on techniques for large datasets.

In summary: a well-written, well-structured and readable book on machine learning algorithms with demonstrations based on an extensive machine learning framework. Definitely one to read and come back to for reference.

Book review: Tableau 8 – the official guide by George Peck

tableau 8 guideThis review was first published at ScraperWiki.

A while back I reviewed Larry Keller’s book The Tableau 8.0 Training Manual, at the same time I ordered George Peck’s book Tableau 8: the official guide. It’s just arrived. The book comes with a DVD containing bonus videos featuring George Peck’s warm, friendly tones and example workbooks. I must admit to being mildly nonplussed at receiving optical media, my ultrabook lacking an appropriate drive, but I dug out the USB optical drive to load them up. Providing an online link would have allowed the inclusion of up to date material, perhaps covering the version 8.1 announcement.

Tableau is a data visualisation application, aimed at the business intelligence area and optimised to look at database shaped data. I’m using Tableau on a lot of the larger datasets we get at ScraperWiki for sense checking and analysis.

Colleagues have noted that analysis in Tableau looks like me randomly poking buttons in the interface. From Peck’s book I learn that the order in which I carry out random clicking is important since Tableau will make a decision on what you want to see based both on what you have clicked and also its current state.

To my mind the heavy reliance on the graphical interface is one of the drawbacks of Tableau, but clearly, to business intelligence users and journalists, it’s the program’s greatest benefit. It’s a drawback because capturing what you’ve done in a GUI is tricky. Some of the scripting/version control capability is retained since most Tableau files are in plain XML format with which a little fiddling is tacitly approved by Tableau – although you won’t find such info in The Official Guide. I’ve been experimenting with using git source control on workbook files, and it works.

If you’re interested in these more advanced techniques then the Tableau Knowledgebase is worth a look. See this article, for example, on making a custom colour palette. I also like the Information Lab blog, 5 things I wish I knew about Tableau when I started and UK Area Polygon Mapping in TableauThe second post covers one of the bug-bears for non-US users of Tableau: the mapping functionality is quite US-centric.

Peck covers most of the functionality of Tableau, including data connections, making visualisations, a detailed look at mapping, dashboards and so forth. I was somewhat bemused to see the scatter plot described as “esoteric”. This highlights the background of those typically using Tableau: business people not physical scientists, and not necessarily business people who understand database query languages. Hence the heavy reliance on a graphical user interface.

I particularly liked the chapters on data connections which also described the various set, group and combine operations. Finally I understand the difference between data blending and data joining: joining is done at source between tables on the same database whilst blending is done on data from different sources by Tableau, after it has been loaded. The end result is not really different.

I now understand the point of table calculations – they’re for the times when you can’t work out your SQL query. Peck uses different language from Tableau in describing table calculations. He uses “direction” to refer to the order in which cells are processed and “scope” to refer to the groups over which cell calculations are performed. Tableau uses the terms “addressing” and “partitioning” for these two concepts, respectively.

Peck isn’t very explicit about the deep connections between SQL and Tableau but makes sufficient mention of the underlying processes to be useful.

It was nice to see a brief, clear description of the options for publishing Tableau workbooks. Public is handy and free if you want to publish to all. Tableau Online presents a useful halfway house for internal publication whilst Tableau Server gives full flexibility in scheduling updates to data and publishing to a range of audiences with different permission levels. This is something we’re interested in at ScraperWiki.

The book ends with an Appendix of functions available for field calculations.

In some ways Larry Keller and George Peck’s books complement each other, Larry’s book (which I reviewed here) contains the examples that George’s lacks and George’s some of the more in depth discussion missing from Larry’s book.

Overall: a nicely produced book with high production values, good but not encyclopedic coverage.

Book Review: Backroom Boys by Francis Spufford

backroomboysElectronic books bring many advantages but for a lengthy journey to Trento a paper book seemed more convenient. So I returned to my shelves to pick up Backroom Boys: The Secret Return of the British Boffin by Francis Spufford.

I first read this book quite some time ago, it tells six short stories of British technical innovation. It is in the character of Empire of the Clouds and A computer called LEO.  Perhaps a little nationalistic and regretful of opportunities lost.

The first of the stories is of the British space programme after the war, it starts with the disturbing picture of members of the British Interplanetary Society celebrating the fall of a V2 rocket in London. This leads on to a brief discussion of Blue Streak – Britain’s ICBM, scrapped in favour of the American Polaris missile system. As part of the Blue Streak programme a rocket named Black Knight was developed to test re-entry technology from this grew the Black Arrow – a rocket to put satellites into space.

In some ways Black Arrow was a small, white elephant from the start. The US had offered the British free satellite launches. Black Arrow was run on a shoestring budget, kept strictly as an extension of the Black Knight rocket and hence rather small. The motivation for this was nominally that it could be used to gain experience for the UK satellite industry and provide an independent launch system for the UK government, perhaps for things they wished to keep quiet. Ultimately it launched a single test satellite into space, still orbiting the earth now. However, it was too small to launch the useful satellites of the day and growing it would require complete redevelopment. The programme was cancelled in 1971.

Next up is Concorde, which could probably be better described as a large, white elephant. Developed in a joint Anglo-French programme into which the participants were mutually locked it burned money for nearly two decades before the British part was taken on by British Airways who used it to enhance the prestige of their brand. As a workhorse, commercial jet, it was poor choice: too small, too thirsty, and too loud.

But now for something more successful! Long ago there existed a home computer market in the UK, populated by many and various computers. First amongst these early machines was the BBC Micro. For which the first blockbuster game, Elite, was written by two Cambridge undergraduates (David Braben and Ian Bell). I played Elite in one of its later incarnations – on an Amstrad CPC464. Elite was a space trading and fighting game with revolutionary 3D wireframe graphics and complex gameplay. And it all fitted into 22kb – the absolute maximum memory available on the BBC Micro. The cunning required to build multiple universes in such a small space, and the battles to gain a byte here and a byte there to add another feature are alien to the modern programmers eyes. At the time Acornsoft were publishing quite a few games but Elite was something different: they’d paid for the development which took an unimaginable 18 months or so and when it was released there was a launch event at Alton Towers and the game came out in a large box stuffed with supporting material. All of this was a substantial break with the past. Ultimately the number of copies of Elite sold for the BBC Micro approximately matched the number of BBC Micros sold – an apparent market saturation.

Success continues with the story of Vodaphone – one of the first two players in the UK mobile phone market. The science here is in radio planning – choosing where to place your masts for optimal coverage, Vodaphone bought handsets from Panasonic and base stations from Ericsson. Interestingly Europe and the UK had a lead over the US in digital mobile networks – they agreed the GSM standard which gave instant access to a huge market. Whilst in the US 722 franchises were awarded with no common digital standard.

Moving out of the backroom a little is the story of the Human Genome Project, principally the period after Craig Venter announced he was going to sequence the human genome faster than the public effort then sell it! This effort was stymied by the Wellcome Trust who put a great deal further money into the public effort. Genetic research has a long history in the UK but the story here is one of industrial scale sequencing, quite different from conventional lab research and the power of the world’s second largest private research funder (the largest is currently the Bill & Melinda Gates Foundation).

The final chapter of the book is on the Beagle 2 Mars lander, built quickly, cheaply and with the huge enthusiasm and (unlikely) fund raising abilities of Colin Pillinger. Sadly, as the Epilogue records the lander became a high velocity impactor – nothing was heard from it after it left the Mars orbiter which had brought it from the Earth.

The theme for the book is the innate cunning of the British, but if there’s a lesson to be learnt it seems to be that thinking big is a benefit. Elite, the mobile phone network, the Human Genome Project were the successes from this book. Concorde was a technical wonder but an economic disaster. Black Arrow and Beagle 2 suffered from being done on a shoestring budget.

Overall I enjoyed the Backroom Boys, it reminded me of my childhood with Elite and the coming of the mobile phones. It’s more a celebration than a dispassionate view but there’s no harm in that.

Book Review: Georgian London–Into the Streets by Lucy Inglis

GeorgianLondonI saw the gestation of Georgian London: Into the streets by Lucy Inglis, so now it is born – I had to buy it!

Lucy Inglis has been blogging about Georgian London for much of the last four years, and I have been reading since then. Her focus is the stories of everyday folk, little snippets from contemporary records surrounded by her extensive knowledge of the period.

The book starts with some scene settings, in particular the end of the Restoration (1660), the Plague (1665), the Great Fire of London (1666) and the Glorious Revolution (1688). These events shape the stage for the Georgian period which covers the years 1714 to 1837, named for the succession of King George’s who reigned through the period marked by the death of William IV (don’t ask me).

London is then covered geographically, using John Rocque’s rather fabulous 1746 map as ornamentation. What is obvious to even those such as myself who are broadly ignorant of the geography of London is how much smaller London was then. Areas such as Islington, which I consider to be in the heart of London were on the edge of the city at the time, rural locations with farming and so forth. The period saw a huge expansion in the city from a population of 500,000 at the beginning of the period to 1.5 million by 1831 which much of the growth occurring in the second half of the 18th century.

Georgian London is somewhat resistant to my usual style of “review” which involves combining the usual elements of review and a degree of summary to remind me of what I read. Essentially there is just too much going on for summarising to work! So I will try some sort of vague impressionistic views:

It struck me how the nature of poverty changed with urbanisation; prior to a move to the city the poor could rely to some extent on the support of their parish, moving to London broke these ties and, particularly for women supporting children, this led to destitution. Men could easily travel to find work, either back home or elsewhere – a women with a child couldn’t do this.

The role of the state was rather smaller than it is now, when the time came to build Westminster Bridge, there was no government funding but rather a series of lotteries. The prize for one of these was the Jernegan cistern, a wine container made from quarter of a ton of silver with a capacity of 60 gallons! Another indicator of the smaller size of the state was that in 1730 a quarter of state income was from tax on alcohol, much of it on gin. Currently alcohol duties account for about £10billion per year which is about 1.5% of the total government spending.

Businesswomen make regular appearances through the book, for example such as Elinor James who was the widow of a printer, Thomas James but published under her own name. She was both a speaker and a pamphleteer, working at the beginning of the 18th century. At the end of the century, the younger Eleanor Coade, was running a thriving business making artificial stone (Coade stone). She’d first come to London in 1769, with her mother, also Eleanor following the death of her father.

At the same time that a quarter of all government revenue came from alcohol duties, a quarter of all gin distillers were women. Alcohol caused many social problems, particularly in the second quarter of the 18th century, as recorded by Hogarth’s “Gin Line”. The vice of the upper classes in the second half of the 18th century was gambling.

The Tower of London housed exotic animals for many years, providing a money-raising visitor attraction through the Georgian period, only losing it status in 1835 on the creation of London Zoo in Regent’s Park. A few years earlier, in 1832, the Tower of London hosted 280 beasts of varying types but it was becoming clear it was an unsuitable location to keep animals. The British were also becoming more aware of animal cruelty, with animal baiting becoming less popular through the Georgian period – culminating with the Act to Prevent the Cruel and Improper Treatment in 1822, and the formation of the RSPCA a couple years later.

It seems useful to know that London’s first street numbers where introduced in 1708.

The voice of the book is spot-on, conversational but authoritative, providing colour without clumsiness. There are no footnotes but there are extensive notes at the end of the book, along with a bibliography. For someone trying to write a blog post like this, the index could do with extension!

It’s difficult to write a review of a book by someone you know, all I can say is that if I didn’t like it I would have not written this. Don’t just take it from me – see what the Sunday Sport thought!