This post was first published at ScraperWiki.
I bought Natural Language Processing in Python by Steven Bird, Ewan Klein & Edward Loper for a couple of reasons. Firstly, ScraperWiki are part of the EU Newsreader Project which seeks to make a “history recorder” using natural language processing to convert large streams of news articles into a more structured form. ScraperWiki’s role in this project is to scrape open sources of news related material, such as parliamentary records and to drive exploitation of the results of this work both commercially and through our contacts in the open source community. Although we’re not directly involved in the natural language processing work it seems useful to get a better understanding of the area.
Secondly, I’ve recently given a talk at Data Science London, and my original interpretation of the brief was that I should talk a bit about natural language processing. I know little of this subject so thought I should read up on it, as it turned out no natural language processing was required on my part.
This is the book of the Natural Language Toolkit Python library which contains a wide range of linguistic resources, methods for processing those resources, methods for accessing new resources and small applications to give a user-friendly interface for various features. In this context “resources” mean the full text of various books, corpora(large collections of text which have been marked up to varying degrees with grammatical and other data) and lexicons (dictionaries and the like).
Natural Language Processing is didactic, it is intended as a text for undergraduates with extensive exercises at the end of each chapter. As well as teaching the fundamentals of natural language processing it also seeks to teach readers Python. I found this second theme quite useful, I’ve been programming in Python for quite some time but my default style is FORTRANIC. The authors are a little scornful of this approach, they present some code I would have been entirely happy to write and describe it as little better than machine code! Their presentation of Python starts with list comprehensions which is unconventional, but goes on to cover the language more widely.
The natural language processing side of the book progresses from the smallest language structures (the structure of words), to part of speech labeling, phrases to sentences and ultimately deriving logical statements from natural language.
Perhaps surprisingly tokenization and segmentation, the process of dividing text into words and sentences respectively is not trivial. For example acronyms may contain full stops which are not sentence terminators. Less surprisingly part of speech (POS) tagging (i.e. as verb, noun, adjective etc) is more complex since words become different parts of speech in different contexts. Even experts sometimes struggle with parts of speech labeling. The process of chunking – identifying noun and verb phrases is of a similar character.
Both chunking and part of speech labeling are tasks which can be handled by machine learning. The zero order POS labeller assumes everything is a noun, the next simplest method is a simple majority voting one which takes the POS tag for previous word(s) and assumes the most frequent tag for the current word based on an already labelled body of text. Beyond this are the machine learning algorithms which take feature sets, including the tags of neighbouring words, to provide a best estimate of the tag for the word of interest. These algorithms include Bayesian classifiers, decision trees and the like, as discussed in Machine Learning in Action which I have previously reviewed. Natural Language Processing covers these topics fairly briefly but provides pointers to take things further, in particular highlighting that for performance reasons one may use external libraries from the Natural Language Toolkit library.
The final few chapters on context free grammars exceeded the limits of my understanding for casual reading, although the toy example of using grammars to translate natural language queries to SQL clarified the intention of these grammars for me. The book also provides pointers to additional material, and to where the limits of the field of natural language processing lie.
I enjoyed this book and recommend it, it’s well written with a style which is just the right level of formality. I read it on the train so didn’t try out as many of the code examples as I would have liked – more of this in future. You don’t have to buy this book, it is available online in its entirety but I think it is well worth the money.