Tag: data science

The Logging module in Python

In the spirit of improving my software engineering practices I have been trying to make more use of the Python logging module. In common with many programmers my first instinct when debugging a programming problem is to use print statements (or their local equivalent) to provide an insight into what my program is up to. Obviously, I should be making use of any debugger provided but there is something reassuring about the immediacy and simplicity of print.

A useful evolution of the print statement in Python is the logging module which can be used as a simple print function but it can do so much more: you can configure loggers for different packages and modules whose behaviour can be controlled centrally; you can vary the verbosity of your logging messages. If you decide to switch to logging to a file rather than the terminal this can be achieved too, and you can even post your log messages to a website using HTTPhandler. Obviously logging is about much more than debugging.

I am writing this blog post because, as most of us have discovered, using logging is not quite as straightforward as we were led to believe. In particular you might find yourself in the situation where you feel you have set up your logging yet when you run your code nothing appears in your terminal window. Print doesn’t do this to you!

Loggers are arranged in a hierarchy. Loggers have handlers which are the things that cause a log to generate output to a device. If no log is specified then a default log called the root log is used. A logger has a name and the hierarchy is defined by the dots in the name, all the way “up” to the root logger. Any logger can have a handler attached to it, if no handler is attached then any log message is passed to the parent logger.

A log record has a message (the thing you would have printed) and a “level” which indicates the severity of the message these are specified by integers for which the logging module provides convenient labels. The levels in order of severity are logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL. A log handler will output a message if the level of the message is equal to or more than the level it has been set to. So a handler set to WARNING will show messages at the WARNING, ERROR and CRITICAL levels but not the INFO and DEBUG levels.

The simplest way to use the logging module is to import the library:

import logging

Then carry out some minimal configuration,

logging.basicConfig(level=logging.INFO)

and then put logging.info statements in our code, just as we would have done with print statements:

logging.info("This is a log message that takes a parameter = {}".format(a_parameter_value))

logging.debug, logging.warning, logging.error and logging.critical are used to publish log messages with different levels of severity. These are all convenience methods which remove the need to explicitly give the level as found in the logging.log function:

logging.log(logging.INFO, "This is a log message")

If we are writing a module, or other code that we anticipate others importing and running then we should create a logger using logging.getLogger(__name__) but leave configuring it to the caller. In this instance we use the name of the logger we have created instead of the module level “logging”. So to publish a message we would do:

logger = logging.getLogger(__name__)
logger.info("Hello")

In the module importing this library you would do something like:

import some_library
logging.basicConfig(level=logging.INFO)
# if you wanted to tweak the levels of another logger 
logger = logging.getLogger("some other logger")
logger.setLevel(logging.DEBUG)

basicConfig() configures the root logger which is where all messages end up in the absence of any other handler. The behaviour of logging.basicConfig() is downright obstructive at times. The core of the problem is that it can only be invoked once in a session, any future invocations are ignored. Worse than this it can be invoked implicitly. So if for example you do:

import logging
logging.warning("Hello")

You’ll see a message because secretly logging has effectively run logging.basicConfig(level=logging.WARNING) for you (or something similar). This means that if you were to then naively go ahead and run basicConfig yourself:

logging.basicConfig(level=logging.INFO)

You would see no message when you subsequently ran logging.info(“Hello”) because the “second” invocation of logging.basicConfig is ignored.

We can explicitly set the properties of the root logger by doing:

root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)

You can debug issues like this by checking the handlers to a logger. If you do:

import logging
lgr = logging.getLogger()
lgr.handlers

You get the empty list []. Issue a logging.warning() message and you see that a handler has been added to the root logger, lgr.handlers() returns something like [<logging.StreamHandler at 0x44327f0>].

If you want to see a list of all the loggers in the hierarchy then do:

logging.Logger.manager.loggerDict

So there you go, the logging module is great – you should use it instead of print. But beware of the odd behaviour of logging.basicConfig() which I’ve spent most of this post griping about. This is mainly so that I have all my knowledge of logging in one place rather than trying to remember which piece of code I pulled off a particular trick.

I used the logging documentation here, blog posts by Fang (here) and Praveen Gollakota (here) and tab completion in the ipython REPL in the preparation of this post.

Book review: Beautiful JavaScript edited by Anton Kovalyov

beautiful_javascriptI have approached JavaScript in a crabwise fashion. A few years ago I managed to put together some visualisations by striking randomly at the keyboard. I then read Douglas Crockford’s JavaScript: The Good Parts, and bought JavaScript Bible by Danny Goodman, Michael Morrison, Paul Novitski, Tia Gustaff Rayl which I used as a monitor stand for a couple of years.

Working at ScraperWiki (now The Sensible Code Company), I wrote some more rational code whilst pair-programming with a colleague. More recently I have been building demonstration and analytical web applications using JavaScript which access databases and display layered maps, some of the effects I achieve are even intentional! The importance of JavaScript for me is that nowadays when I come to make a GUI for my analysis (usually in Python) then the natural thing to do is build a web interface using JavaScript/CSS/HTML because the “native” GUI toolkits for Python are looking dated and unloved. As my colleague pointed out, nowadays every decent web browser comes with a pretty complete IDE for JavaScript which allows you to run and inspect your code, profile network activity, add breakpoints and emulate a range of devices both in display and network bandwidth capabilities. Furthermore there are a large number of libraries to help with almost any task. I’ve used d3 for visualisations, jQuery for just about everything, OpenLayers for maps, and three.js for high performance 3D rendering.

This brings me to Beautiful JavaScript: Leading Programmers Explain How They Think edited by Anton Kovalyov. The book is an edited volume featuring chapters from 15 experienced JavaScript programmers. The style varies dramatically, as you might expect, but chapters are well-edited and readable. Overall the book is only 150 pages. My experience is that learning a programming language is much more than the brute detail of the language syntax, reading this book is my way of finding out what I should do, rather than what it is possible to do.

It’s striking that several of the authors write about introducing class inheritance into JavaScript. To me this highlights the flexibility of programming languages, and possibly the inflexibility of programmers. Despite many years of abstract learning about object-oriented programming I persistently fail to do it, even if the features are available in the language I am using. I blame some of this on a long association with FORTRAN and then Matlab which only introduced object-oriented features later in their lives. “Proper” developers, it seems, are taught to use class inheritance and when the language they select does not offer it natively they improvise to re-introduce it. Since Beautiful JavaScript was published JavaScript now has a class keyword but this simply provides a prettier way of accessing the prototype inheritance mechanism of JavaScript.

Other chapters in Beautiful JavaScript are about coding style. For teams, consistent and unflashy style are more important than using a language to its limits. Some chapters demonstrate just what those limits can be, for example, Graeme Roberts chapter “JavaScript is Cutieful” introduces us to some very obscure code. Other chapters offer practical implementations of a maths parser, a domain specific language parser and some notes on proper error handling.

JavaScript is an odd sort of a language, at first it seemed almost like a toy language designed to do minor tasks on web pages. Twenty years after its birth it is everywhere and multiple billion dollar businesses are built on top of it. If you like you can now code in JavaScript on your server, as well as in the client web browser using node.js. You can write in CoffeeScript which compiles to JavaScript (I’ve never seen the point of this). Chapters by Jonathan Barronville on node.js and Rebecca Murphey on Backbone highlight this growing maturity.

Anton Kovalyov writes on how JavaScript can be used as a functional language. Its illuminating to see this discussion alongside those looking at class inheritance-like behaviour. It highlights the risks of treating JavaScript as a language with class inheritance or being a “true” functional language. The risk being that although JavaScript might look like these things ultimately it isn’t and this may cause problems. For example, functional languages rely on data structures being immutable, they aren’t in JavaScript so although you might decide in your functional programming mode that you will not modify the input arguments to a function JavaScript will not stop from you from doing so.

The authors are listed with brief biographies in the dead zone beyond the index which is a pity because the biographies could very usefully been presented at the beginning of each chapter. They are: Anton Kovalyov , Jonathan Barronville, Sara Chipps, Angus Croll, Marijn Haverbeke, Ariya Hidayat, Daryl Koopersmith, Rebecca Murphey, Danial Pupius, Graeme Roberts, Jenn Schiffer, Jacob Thorton, Ben Vinegar, Rick Waldron, Nicholas Zakas. They have backgrounds with Twitter, Medium, Yahoo and diverse other places.

Beautiful JavaScript is a short, readable book which gives the relatively new JavaScript programmer something to think about.

Book review: Essential SQLAlchemy by Jason Myers and Rick Copeland

sqlalchemyEssential SQLAlchemy by Jason Myers and Rick Copeland is a short book about the Python library SQLAlchemy which provides a programming interface to databases using the SQL query language. As with any software library there is ample online material on SQLAlchemy but I’m old-fashioned and like to buy a book.

SQL was one of those (many) things I was never taught as a scientific programmer, so I went off and read a book and blogged about it (rather more extensively than usual). It’s been a useful skill as I’ve moved away from the physical sciences to more data-oriented software development. As it stands I have a good theoretical knowledge of SQL and databases, and can write fairly sophisticated single table queries but my methodology for multi-table operations is stochastic.

I’m looking at using SQLAlchemy because it’s something I feel I should use, and people with far more experience than me recommend using it, or a similar system. Django, the web application framework, has its own ORM.

Essential SQLAlchemy is divided into three sections, on SQLAlchemy Core, SQL Alchemy ORM and Alembic. The first two represent the two main ways SQLAlchemy interacts with databases. The Core model is very much a way of writing SQL queries but with Pythonic syntax. I can see this having pros and cons. On the plus side I’ve seen SQLAlchemy used to write, succinctly, rather complex join queries. In addition, SQLAlchemy Core allows you to build queries conditionally which is possible by using string manipulation on standard queries but requires some careful thought which SQLAlchemy has done for you. SQLAlchemy allows you to abstract away the underlying database so that, in principle, you can switch from SQLite to PostgresQL seamlessly. In practice this is likely to be a bit fraught since different databases support different functionality. This becomes a problem when it becomes a problem. SQLAlchemy gives your Python programme a context for its queries which I can see being invaluable in checking queries for correctness and documenting the database the programme accesses. On the con side: I usually know what SQL query I want to write, so I don’t see great benefit in adding a layer of Python to write that query.

SQLAlchemy Object Relational Mapper (ORM)  is a different way of doing things. Rather than explicitly writing SQL-like statements we are invited to create classes which map to the database via SQLAlchemy. This leaves us to think about what we want our classes to do rather than worry too much about the database. This sounds fine in principle but I suspect the experienced SQL-user will know exactly what database schema they want to create.

Both the Core and ORM models allow the use of “reflection” to build the Pythonic structures from a pre-existing datatabase.

The third part of the book is on Alembic, a migrations manager for SQLAlchemy, which is installed separately. This automates the process of upgrading your database to a new schema (or downgrading it). You’d want to do this to preserve customer data in a transactional database storing orders or something like that. Here I learnt that SQLite does not have full ALTER TABLE functionality.

A useful pattern in both this book and in Test-driven Development is to wrap database calls in their own helper functions. This helps in testing but it also means that if you need to switch database backend or the library you are using for access then the impact is relatively low. I’ve gone some way to doing this in my own coding.

The sections on Core and ORM are almost word for word repeats with only small variations to account for the different syntax between the two methods. Although it may have a didactic value this is frustrating in a book already so short.

Reading this book has made me realise that the use I put SQL to is a little unusual. I typically use a database to provide convenient access to a dataset I’m interested in, so I do a one off upload of the data, apply indexes and then query. Once loaded the data doesn’t change. The datasets tend to be single tables with limited numbers of lookups which I typically store outside of the database. Updates or transactions are rare, and if I want a new schema then I typically restart from scratch. SQLite is very good for this application. SQLAlchemy, I think, comes into its own in more transactional, multi-table databases where Alembic is used to manage migrations.

Ultimately, I suspect SQLAlchemy does not make for a whole book by itself, hence the briefness of this one despite much repeated material. Perhaps, “SQL for Python Programmers” would work better, covering SQL in general and SQLAlchemy as a special case.

Book review: Test-Driven Development with Python by Harry J.W. Percival

test-drivenTest-Driven Development with Python by Harry J.W. Percival is a tutorial rather than a text book and it taught me as much about Django as testing. I should point out that I wilfully fail to do the “follow along with me” thing in tutorial style books.

Test-driven development (TDD) is a methodology that mandates writing tests first before actual code that does stuff. The first tests are for the desired behaviour that will be presented to the user.

I was introduced to software testing very early in my tenure ScraperWiki, now The Sensible Code Company. I was aware of its existence prior to this but didn’t really get the required impetuous to get me started, it didn’t help that I was mostly coding in Matlab which didn’t have a great deal of support for testing at the time. The required impetus at ScraperWiki was pair programming.

Python is different to Matlab, it has an entirely acceptable testing framework built-in. Test-driven Development made me look at this functionality again. So far I’ve been using the nose testing library but there is a note on its home page now saying it will not be developed further. It turns out Python’s unittest has been developing in Python 3 which reduces the need for 3rd party libraries to help in the testing process. Python now includes the Mock library which provides functions to act as “test doubles” prior to the implementation of the real thing. As an aside I learnt there is a whole set of such tests doubles including mocks, but also stubs, fakes and spies.

Test-driven Development is structured as a tutorial to build a simple list management web application which stores the lists of multiple users, and allows them to add new items to the lists. The workflow follows the TDD scheme: to write failing tests first which development then allows to pass. The first tests are functional tests  of the whole application made using the Selenium webdriver, which automates a web browser, and allows testing of dynamic, JavaScript pages as well as simple static pages. Beneath these functional tests lie unit tests which test isolated pieces of logic and integrated tests which test logic against data sources and other external systems. Integration tests  test against 3rd party services.

The solution is worked through using the Django web framework for Python. I’ve not used it before – I use the rather simpler Flask library. I can see that Django contains more functionality but it is at the cost of more complexity. In places it wasn’t clear whether the book was talking about general testing functionality or some feature of the Django testing functionality. Django includes a range of fancy features alien to the seasoned Flask user. These include its own ORM, user administration systems, and classes to represent web forms.

Test-driven Development has good coverage in terms of the end goal of producing a web application. So not only do we learn about testing elements of the Python web application but also something of testing in JavaScript. (This seems to involve a preferred testing framework for every library). It goes on to talk about some elements of devops, configuring servers using the Fabric library, and also continuous integration testing using Jenkins. These are all described in sufficient detail that I feel I could setup the minimal system to try them out.

Devops still seems to be something of a dark art with a provision of libraries and systems (Chef, Puppet, Ansible, Juju, Salt, etc etc) with no clear, stable frontrunner.

An appendix introduces “behaviour-driven development” which sees sees a test framework which allows the tests to be presented in terms of a domain specific language with (manual) links to the functional tests beneath.

In terms of what I will do differently having read this book. I’m keen to try out some JavaScript testing since my normal development activities involve data analysis and processing using Python but increasingly blingy web interfaces for visualisation and presentation. At the moment these frontends are slightly scary systems which I fear to adjust since they are without tests.

With the proviso above, that I don’t actually follow along, I like the tutorial style. Documentation has its place but ultimately it is good to be guided in what you should do rather than all the things you could possibly do. Test-driven Development introduces the tools and vocabulary you need to work in a test-driven style with the thread of the list management web application tying everything together. Whether it instils in me the strict discipline of always writing tests first remains to be seen.

Book review: The Seven Pillars of Statistical Wisdom by Stephen M. Stigler

sevenpillarsThe Seven Pillars of Statistical Wisdom by Stephen M. Stigler is a brief history of what the author describes as the key pillars of statistics. This is his own selection rather than some consensus of statistical opinion. That said, to my relatively untrained eye the quoted pillars are reasonable. They are as follows:

1 – Aggregation. The use of the arithmetic average or mean is not self-evidently a good thing. It was during the 17th century, when people were taking magnetic measurements in order to navigate, that ideas around the mean started to take hold. Before this time it was not obvious which value one should take when discussing a set of measurement purportedly measuring the same thing. One might take the mid-point of the range of values, or apply some subjective process based on your personal knowledge of the measurer. During the 17th century researchers came to the conclusion that the arithmetic mean was best.

2 – Information. Once you’ve discovered the mean, how good is it as a measure of the underlying phenomena as you increase the size of the aggregation? It seems obvious that the measure improves as the number of trials increases but how quickly? The non-trivial answer to this question is that it scales as the square root of N, the number of measurements. Sadly this means if you double the number of measurements you make, you only improve you confidence in the mean by a factor of a little over 1.4 (that being the square root of 2) . Mixed in here are ideas about the standard deviation, a now routine formulation quoted with the mean. It was originally introduced by De Moivre in 1738, for the binomial distribution, but then generalised by Laplace in 1810 as the Central Limit Theorem.

3 – Likelihood. This relates to estimating confidence that an observed difference is real, and not due to chance. The earliest work, by John Arbuthnot, related to observed sex ratios in births recorded in England and whether they could be observed by chance rather than through a “real” difference in the number of boys and girls born.

4 – Intercomparison. Frequently we wish to compare sets of measurements to see if one thing is significantly different from another. The Student t-test is an example of such a thing. Named for William Gosset, who took a sabbatical from his job at Guiness to work in Karl Pearson’s lab at UCL. As an employee Guiness did not want Gosset’s name to appear on a scientific paper (thus revealing their interest), so he wrote under the rather unimaginative pseudonym “Student”.

5 – Regression. The chapter starts with Charles Darwin, and his disregard for higher mathematics. He professed a faith in measurement and “The Rule of Three”. This is the algebraic identity a/b = c/d which states that if you know any 3 of a, b, c and d you can calculate the 4th. This is true in a perfect world, but in practice we would acquire multiple sets of our three selected values and use regression to obtain a “best fit” for the fourth value. Also in this chapter is Galton’s work on regression to the mean in particularly how parents with extreme heights had children who were on average closer to the mean height. This is highly relevant to the study of evolution and the inheritance of characteristics.

6 – Design. The penultimate pillar is design. In the statistical sense this means the design of an experiment in terms of the numbers of trials, and the organisation of the trials. This starts with a discussion of calculating odds for the French lottery (founded in 1757) and providing up to 4% of the French budget in 1811. It then moves on to RA Fisher’s work at the Rothamsted Research Centre on randomisation in agricultural trials. My experience of experimental design, is that statisticians always want you to do more trials than you can afford, or have time for!

7 – Residual. Plotting the residual left when you have made your best model and taken it from your data is a time honoured technique. Systematic patterns in the residuals can indicate your modern is wrong, that there are new as yet undiscovered phenomena to be discovered. I was impressed to discover in this chapter that Frank Weldon cast 12 dice some  315,672 times to try to determine if they were biased. Data collection can be an obsessive activity. This story from the early 20th century is not in common.

Seven Pillars is oddly pitched, it is rather technical for a general science audience. It is an entertainment, rather than a technical text. The individual chapters would have fitted quite neatly into The Values of Precision, which I have reviewed previously.