May 2022 archive

A way of working: data science

I am about to take on a couple of data science students from Lancaster University for summer projects, from past experience I always spend some time at the beginning of such projects explaining how I work with the expectation that they will at least take some notice if not repeat my methodology exactly. This methodology evolves slowly over time as I learn new things and my favoured technologies change.

Typically I develop on a Windows laptop but I use the git-bash prompt as my shell for typing in commands – this is a Linux-like terminal which I adopted after working with developers who mainly used Linux and also because I was familiar with the Unix style commandline from before the time on Linux. You can do a lot from the commandline in data science – Data Science at the Command Line by Jeroen Janssens is an excellent introduction.

I use Docker containers a bit to spin up local versions of services which are difficult to run on Windows (things like Airflow and Linkedin DataHub), some people develop entirely inside Docker containers to reduce dependency issues and make deployment of code easier.

I work pretty much entirely in Python for data processing and analysis although I generate CSV files which I load to Tableau for visualisation. I tend not to try complex processing in Tableau since I find the GUI inconvenient and confusing for such work. I use the Anaconda distribution of Python, originally because I liked that it came packaged with a load of useful libraries for data science and it handled virtual environments and installation of more tricky packages better than plain Python. It may be worth revisiting this decision. I have recently shifted my code to Python 3.9.

For a piece of work I will usually set up a Python project which can be “installed”. This blog post explains a standard structure for Python projects. I aim to use Python virtual environments on a per project basis but sometimes I fail. Typically, I will write Python modules that provide functions but also have a simple command line interface which takes two or three positional parameters. You can see this in action in the git repo here which I share as a template for myself and others!

To date I have picked up commandline arguments using sys.argv I should probably use one of the libraries to make these commandline interfaces better, there is a blog post here which compares the built-in argparse library with click and docopt. I think I might check out click for future projects.

As well as running commandline scripts I use tests to develop analysis, as well as being good software development practice, test runners make a convenient way to run arbitrary functions in a code base. I prefer to use the unittest built-in library but I’ve started using pytest for a recent project. I wrote a blog post about writing tests, since I wrote it I have learned about test mocks and pytest’s fixture functionality.

I have a library of general utilities for interacting with databases, setting up logging and writing dictionaries which I wrote because I found I was doing these things repeatedly and making my own library allowed me to forgot some of the boilerplate code required to do these things. The key utilities are included with the repo attached to this blog.

I’ve been using Visual Code as my editor for some time now, I prefer not to use full blown IDEs because I find they present more functionality than I can cope with. I think this is as a result of coding in Java using Eclipse and C# .net in Visual Studio. In any case Visual Code starts as a nice enough code editor but has been sneaking in more IDE functionality via extensions.

The extensions I use heavily in Visual Code are Python and Pylance – the Python language server provides type-hinting support. I wrote about type-hinting in Python here. I also use Rainbow CSV for when I am editing or viewing CSV files.

I could use Visual Code for accessing git, my preferred source control system, but instead I use GitKraken which has a very nice GUI interface. Since I am usually working by myself my git usage is very simple, I typically have one branch onto which I make many small commits. I have recently started working with a team where I am using feature-based branches which get merged by pull requests – this was a bit of a culture shock.

As a result of working with other people on a new project I have started using some technologies which I will just mention here. I run the black formatter, as well pylint and flake8. Black just reformats my code files when I save them and can largely be ignored. Flake8 is fairly easy to satisfy although I spent a lot of time addressing line length issues. Pylint generates quite a few warnings which I attend to but sometimes ignore.

I have also started using Make files and Azure Devops pipelines for running common tasks on my code (tests, cleanup, setting up infrastructure, linting).

Outside technology, I have a very long established method of working using a monthly Word document as a notebook, I describe it here. I tend to prefix file names with ISO8601 format dates (2022-05-22) this means that if I created a Tableau workbook or an Excel worksheet I can link it easily to what I was writing in my notebook and the status of the appropriate git repo at that point in time.

I’ve incorporated all the code related elements mentioned above in this ways-of-working-data-science git repository.

Book review: Pale Rider – The Spanish Flu of 1918 by Laura Spinney

pale_riderPale Rider: The Spanish Flu and How it Changed the World by Laura Spinney is obviously very topical at the moment, it was published in June 2017 which makes it more striking how relevant it is than if it had been published in the last two years.

The book starts with an overall chronology of the 1918 flu pandemic before return to specific themes, generally through the medium of personal accounts or individual incidents. It is worth highlighting that the "Spanish" label is highly misleading, essentially the 1918 flu pandemic arose somewhere between the American mid-West, Northern France on the battle fields of the First World War or, a remote possibility, in China. Spinney discusses the link with viruses found in wildlife and livestock.

Initial estimates as to the death toll of the 1918 flu pandemic were around 25 million but these have been revised upwards recently to up to 100 million. Furthermore, the 1918 flu pandemic largely took place over September to December in 1918 with smaller waves in the spring of 1918 and in the following spring and with there were some variations by geography as to exactly when the worst effects were felt. So 1918 flu pandemic was a shorter, more devastating pandemic than the 2020 covid pandemic (which has killed around 3 million of a much large population). This was against the back drop of the First World War which killed more people in Europe than the pandemic, although around the world Europe was the exception with more killed by pandemic in all other continents.

The context for the 1918 flu pandemic was different too, the 19th century had been one of epidemics driven by industrialisation and the associated urbanisation. Amongst those were flu pandemics and 1830 and 1890. The 1890 "Russian" flu pandemic, was the first to be measured as a pandemic. The 1918 pandemic was at a time when the germ theory of disease was being developed, and the value of hygiene was understood. However, viral diseases were not well understood and it was not until the 1930s that the mechanism of transmission for flu was discovered with the first flu vaccines coming in 1936. It was not until the 1950s that it was confirmed as a viral disease. The symptoms of this flu pandemic were quite different from those of the covid pandemic with a mahogany colouration forming on the cheekbones that spread progressively until death, teeth and hair falling out and delirium (leading to suicide).

The health measures taken to address the 1918 pandemic were not that different from those used recently with sanitary cordons and quarantine used extensively. Religious ceremonies were exempt from restrictions in Spain leading to more cases. Closing schools was argued over with those in favour seeing schools as better for the monitoring of outbreaks, communication of health information, and offering better sanitary conditions, and food, to children. Starvation was a problem with supply chains effected from start to finish.

It is interesting to see the varying responses of Australia and New Zealand between the 1918 pandemic and the covid pandemic, Australia isolated in 1918, as it did in the covid pandemic but in 1918 they did not. The disproportionate impacts of the 1918 pandemic were also in evidence, with the recent Italian immigrants to the US, India and remote native American communities in Alaska very badly effected with mortality rates of up to 40%.

The pandemic had arguable impacts in world affairs, Woodrow Wilson had a serious stroke probably as a result of a bout of flu, and was not present to limit the war reparations against Germany.The independence movement in India grew. The flu impacted people in their twenties and thirties quite heavily, leaving behind a generation of orphans – their treatment was handled with new legislation by France and England. There was a post-pandemic (and war) fertility boom.

Despite the enormous death toll, even compared to the First World War, the 1918 pandemic appeared to have little impact on art and literature although scholars will look for signs of post-viral fatigue in paintings. Spinney argues this is because insufficient time has passed, noting that there are approaching 80,000 books on the First World War and but only 400 on the 1918 pandemic – but this number is growing rapidly. It has made me wonder about the lost siblings, in my grandparents generation which were never spoken of – similarly the absence of stories from fighting age men of the Second World War. Essentially these stories were too painful to handle at a human, personal level and the culture in the UK at least would not have been to speak about them. So it is left to historians and the passage of time for the stories to come to light.

A second factor, proposed by psychologists, is that pandemics lack a good story line with a clear beginning and end and a selection of heroes – unlike the First World War.

The Pale Rider is very readable, it is difficult to use the word "enjoy" regarding a book which tells of the deaths of 100 million people. I was struck by how relevant the 1918 flu pandemic was to our current situation with the disparate impacts depending on country and social conditions, the debates over school closures, the dedication of medical staff, the measures to address the pandemic and the debates over the compliance with public health measures. The covid pandemic is different – it has played out over a longer period, it has a far lower death toll, our medical knowledge is much improved, our world is much more connected but nevertheless The Pale Rider feels very prescient.