Ian Hopkinson

Author's posts

A way of working: data science

I am about to take on a couple of data science students from Lancaster University for summer projects, from past experience I always spend some time at the beginning of such projects explaining how I work with the expectation that they will at least take some notice if not repeat my methodology exactly. This methodology evolves slowly over time as I learn new things and my favoured technologies change.

Typically I develop on a Windows laptop but I use the git-bash prompt as my shell for typing in commands – this is a Linux-like terminal which I adopted after working with developers who mainly used Linux and also because I was familiar with the Unix style commandline from before the time on Linux. You can do a lot from the commandline in data science – Data Science at the Command Line by Jeroen Janssens is an excellent introduction.

I use Docker containers a bit to spin up local versions of services which are difficult to run on Windows (things like Airflow and Linkedin DataHub), some people develop entirely inside Docker containers to reduce dependency issues and make deployment of code easier.

I work pretty much entirely in Python for data processing and analysis although I generate CSV files which I load to Tableau for visualisation. I tend not to try complex processing in Tableau since I find the GUI inconvenient and confusing for such work. I use the Anaconda distribution of Python, originally because I liked that it came packaged with a load of useful libraries for data science and it handled virtual environments and installation of more tricky packages better than plain Python. It may be worth revisiting this decision. I have recently shifted my code to Python 3.9.

For a piece of work I will usually set up a Python project which can be “installed”. This blog post explains a standard structure for Python projects. I aim to use Python virtual environments on a per project basis but sometimes I fail. Typically, I will write Python modules that provide functions but also have a simple command line interface which takes two or three positional parameters. You can see this in action in the git repo here which I share as a template for myself and others!

To date I have picked up commandline arguments using sys.argv I should probably use one of the libraries to make these commandline interfaces better, there is a blog post here which compares the built-in argparse library with click and docopt. I think I might check out click for future projects.

As well as running commandline scripts I use tests to develop analysis, as well as being good software development practice, test runners make a convenient way to run arbitrary functions in a code base. I prefer to use the unittest built-in library but I’ve started using pytest for a recent project. I wrote a blog post about writing tests, since I wrote it I have learned about test mocks and pytest’s fixture functionality.

I have a library of general utilities for interacting with databases, setting up logging and writing dictionaries which I wrote because I found I was doing these things repeatedly and making my own library allowed me to forgot some of the boilerplate code required to do these things. The key utilities are included with the repo attached to this blog.

I’ve been using Visual Code as my editor for some time now, I prefer not to use full blown IDEs because I find they present more functionality than I can cope with. I think this is as a result of coding in Java using Eclipse and C# .net in Visual Studio. In any case Visual Code starts as a nice enough code editor but has been sneaking in more IDE functionality via extensions.

The extensions I use heavily in Visual Code are Python and Pylance – the Python language server provides type-hinting support. I wrote about type-hinting in Python here. I also use Rainbow CSV for when I am editing or viewing CSV files.

I could use Visual Code for accessing git, my preferred source control system, but instead I use GitKraken which has a very nice GUI interface. Since I am usually working by myself my git usage is very simple, I typically have one branch onto which I make many small commits. I have recently started working with a team where I am using feature-based branches which get merged by pull requests – this was a bit of a culture shock.

As a result of working with other people on a new project I have started using some technologies which I will just mention here. I run the black formatter, as well pylint and flake8. Black just reformats my code files when I save them and can largely be ignored. Flake8 is fairly easy to satisfy although I spent a lot of time addressing line length issues. Pylint generates quite a few warnings which I attend to but sometimes ignore.

I have also started using Make files and Azure Devops pipelines for running common tasks on my code (tests, cleanup, setting up infrastructure, linting).

Outside technology, I have a very long established method of working using a monthly Word document as a notebook, I describe it here. I tend to prefix file names with ISO8601 format dates (2022-05-22) this means that if I created a Tableau workbook or an Excel worksheet I can link it easily to what I was writing in my notebook and the status of the appropriate git repo at that point in time.

I’ve incorporated all the code related elements mentioned above in this ways-of-working-data-science git repository.

Book review: Pale Rider – The Spanish Flu of 1918 by Laura Spinney

pale_riderPale Rider: The Spanish Flu and How it Changed the World by Laura Spinney is obviously very topical at the moment, it was published in June 2017 which makes it more striking how relevant it is than if it had been published in the last two years.

The book starts with an overall chronology of the 1918 flu pandemic before return to specific themes, generally through the medium of personal accounts or individual incidents. It is worth highlighting that the "Spanish" label is highly misleading, essentially the 1918 flu pandemic arose somewhere between the American mid-West, Northern France on the battle fields of the First World War or, a remote possibility, in China. Spinney discusses the link with viruses found in wildlife and livestock.

Initial estimates as to the death toll of the 1918 flu pandemic were around 25 million but these have been revised upwards recently to up to 100 million. Furthermore, the 1918 flu pandemic largely took place over September to December in 1918 with smaller waves in the spring of 1918 and in the following spring and with there were some variations by geography as to exactly when the worst effects were felt. So 1918 flu pandemic was a shorter, more devastating pandemic than the 2020 covid pandemic (which has killed around 3 million of a much large population). This was against the back drop of the First World War which killed more people in Europe than the pandemic, although around the world Europe was the exception with more killed by pandemic in all other continents.

The context for the 1918 flu pandemic was different too, the 19th century had been one of epidemics driven by industrialisation and the associated urbanisation. Amongst those were flu pandemics and 1830 and 1890. The 1890 "Russian" flu pandemic, was the first to be measured as a pandemic. The 1918 pandemic was at a time when the germ theory of disease was being developed, and the value of hygiene was understood. However, viral diseases were not well understood and it was not until the 1930s that the mechanism of transmission for flu was discovered with the first flu vaccines coming in 1936. It was not until the 1950s that it was confirmed as a viral disease. The symptoms of this flu pandemic were quite different from those of the covid pandemic with a mahogany colouration forming on the cheekbones that spread progressively until death, teeth and hair falling out and delirium (leading to suicide).

The health measures taken to address the 1918 pandemic were not that different from those used recently with sanitary cordons and quarantine used extensively. Religious ceremonies were exempt from restrictions in Spain leading to more cases. Closing schools was argued over with those in favour seeing schools as better for the monitoring of outbreaks, communication of health information, and offering better sanitary conditions, and food, to children. Starvation was a problem with supply chains effected from start to finish.

It is interesting to see the varying responses of Australia and New Zealand between the 1918 pandemic and the covid pandemic, Australia isolated in 1918, as it did in the covid pandemic but in 1918 they did not. The disproportionate impacts of the 1918 pandemic were also in evidence, with the recent Italian immigrants to the US, India and remote native American communities in Alaska very badly effected with mortality rates of up to 40%.

The pandemic had arguable impacts in world affairs, Woodrow Wilson had a serious stroke probably as a result of a bout of flu, and was not present to limit the war reparations against Germany.The independence movement in India grew. The flu impacted people in their twenties and thirties quite heavily, leaving behind a generation of orphans – their treatment was handled with new legislation by France and England. There was a post-pandemic (and war) fertility boom.

Despite the enormous death toll, even compared to the First World War, the 1918 pandemic appeared to have little impact on art and literature although scholars will look for signs of post-viral fatigue in paintings. Spinney argues this is because insufficient time has passed, noting that there are approaching 80,000 books on the First World War and but only 400 on the 1918 pandemic – but this number is growing rapidly. It has made me wonder about the lost siblings, in my grandparents generation which were never spoken of – similarly the absence of stories from fighting age men of the Second World War. Essentially these stories were too painful to handle at a human, personal level and the culture in the UK at least would not have been to speak about them. So it is left to historians and the passage of time for the stories to come to light.

A second factor, proposed by psychologists, is that pandemics lack a good story line with a clear beginning and end and a selection of heroes – unlike the First World War.

The Pale Rider is very readable, it is difficult to use the word "enjoy" regarding a book which tells of the deaths of 100 million people. I was struck by how relevant the 1918 flu pandemic was to our current situation with the disparate impacts depending on country and social conditions, the debates over school closures, the dedication of medical staff, the measures to address the pandemic and the debates over the compliance with public health measures. The covid pandemic is different – it has played out over a longer period, it has a far lower death toll, our medical knowledge is much improved, our world is much more connected but nevertheless The Pale Rider feels very prescient.

Book review: Railways and The Raj by Christian Wolmar

railways_and_the_rajTwo interests combine with this book, Railways and The Raj by Christian Wolmar. I picked it up after a recommendation in Empireland by Sathnam Sanghera, which is about the British Empire from an Indian perspective but I’m also interested in railways. I have reviewed Wolmar’s Fire & Steam and The Subterranean Railway in the past. The Indian railway system has been sold as a benefit of colonialism, so I was interested to find out more.

Although the first railways in India were built as early as 1836, not long after those elsewhere, and for similar purposes: for shifting heavy loads short-distances at mines or similar, it wasn’t until the middle of the 19th century that railway building in earnest started. This followed two reports written by the Governor-General of India, Lord Dalhousie, in 1850 and 1853. In contrast to the chaotic growth of railways in Britain and elsewhere, Dalhousie’s plans, formulated a little after the first rush of railway building, presented a rational and coherent plan for the development of Indian railways.

The start to railway building was slow, with opposition from the East India Company in the first instance, furthermore physical conditions in India were challenging particularly the monsoon season which played havoc with railway bridges over rivers, and whose embankments disturbed the irrigation and drainage in surrounding areas. There were also serious mountain ranges to address.

The Indian railways were built very much for the benefit of the British, most of the rail companies were run from Britain, the levels of return on investment (made from Britain) were guaranteed by the Indian tax payer, most of the equipment (including rails and often sleepers) was sourced from Britain and the economic benefits of the freight transported by the railways were largely in Britain. Not only this, under the Raj, the senior positions in managing and running the railways were held by British people or Eurasians, and this extended to the train staff with drivers predominately British or Eurasian. The British travelling on the railways did so in luxurious first and second class carriages whereas the great majority of Indians travelled in a fairly grim third class.

Class, religious and gender differences were built into the fabric of the railway with various facilities provided separately for Muslim and Hindu passengers, and various castes. I struggle to decide how much this was a deliberate "divide and rule" policy of the British (which was later to have terrible consequences during Partition) or whether it was the right thing to do to respect local sensibilities (although it is fair to say "respecting local sensibilities" was not greatly in evidence during Britain’s colonial period).

There was some development of railways for famine relief – a recurring issue in Indian where millions died through famine in parts of the country. Beyond about 50 miles oxen, the main alternative for transporting food, consume more food than they can carry. The Victorian view was that the railway would carry food to be sold at the market rate from areas of surplus to those suffering famine, which did not greatly help the many poor unable to afford food.

There were lines built for military purposes, particularly in the north west in the direction of Afghanistan from where it was feared a Russian threat would come. More generally, as the railways developed the Indian Rebellion of 1857 was still fresh in the mind of the British and it was felt the railway could help move troops around to quell future rebellions – many early stations were built like fortresses. The railways were important during the two world wars but suffered in these periods from overuse and under-investment.

In a book with a number of shocks for white British sensibilities, I think I found the part on Partition most shocking most probably because it is not something I had thought about before: I knew India had gained independence after the Second World War and that Pakistan, and Bangladesh were part. I had not absorbed that it meant the displacement of between 10 and 20 million people, and the deaths of up to 2 million. 20 million people is a third the population of the United Kingdom and 2 million people is the population of Liverpool, Manchester and Birmingham combined.

After Independence and Partition, the successful running of the railways was seen as an important symbol of the success of Independence. Despite the rather hasty British exit, and the lack of home-grown talent and supply chains the post-Independence Indian Railway was quickly much improved.

One recurring theme of the book is the enormous scale of Indian Railways, it employs currently 1.3 million people – globally ranking alongside various Chinese state bodies, McDonald’s, Walmart and the NHS. In the early days the Indian Railways set up company towns in part to service white British employees but also for Indian employees because the railway works were often in otherwise isolated areas. Even now Indian Railways owns huge amounts of property in which its employees live, and also hospitals and schools. It remains central to transport in Indian where the capacity of the airline routes is limited, and the road network is relatively under-developed.

I enjoyed this book as a story of the development of the railway in India, but also as a sketch of Indian history from the middle of the 19th century. To answer my original question, the railway did benefit India ultimately, after Independence, but under colonial rule it was largely a benefit to Britain.

Book review: Software Design Decoded by Marian Petre and André van der Hoek

66-ways-expertsSoftware Design Decoded: 66 Ways Experts Think by Marian Petre and André van der Hoek is my next read.

I picked it up as a recommendation from The Programmer’s Brain by Felienne Hermans. It is an odd little book, something like A6 format with 66 pages containing a short paragraph or two on the behaviours of experts in software design. Each page dedicated to a single thought. There are sketches scattered liberally though the book by Yen Quach who is credited in the author biographies.

Although it does not have a contents page or index, Software Design Decoded is divided into "chapters":

  • Experts keep it simple
  • Experts collaborate
  • Expers borrow
  • Experts break rules
  • Experts sketch
  • Experts work with uncertainty
  • Experts are not afraid
  • Experts iterate
  • Experts test
  • Experts reflect
  • Experts keep going

I found this book reassuring as much as anything, and it also gave me some things to think about. Reassuring because it turns out I share habits with expert in software design, which must be a start to being an expert! I write quite a lot of software (for data analysis and data builds) but design tends to come as an afterthought.

I think the things I already do are to build something even if it isn’t the final form, I was interested in the comment about avoiding over-generalisation. The element I am missing here is to learn from this initial form and build something better (potentially discarding what I’ve already done). I also do a fair bit of testing, although in this book testing is wider than just software unit tests or even integration tests, it is about testing preconceptions and testing with the user.

I also liked the comment on focusing on the needs of the key stakeholders where the key stakeholders are the end users, this is a recurring theme – that the end users are the key focus, and them using the product/software are when the job is done.

Always learning gets a recommendation as well as not being afraid to use things in manners other than that intended.

I was interested to note the comments on experts forever sketching since it is something I scarcely do, sometimes a write sequences of tricky bits of code with the odd arrow. I remember learning how to draw flow charts in the late seventies but rarely use the skill (certainly not with all the proper symbols). Software Design Decoded is slightly contradictory on this, in one place experts sketch abstractly as an aid to thought with the sketches meaningless beyond the moment, and in another the sketches are kept for reference later and hence clear and well-labelled.

Notation also gets a couple of mentioned, I take this as a formalised system for naming things – something popular with physicists where the right notation is the difference between a page of formulae and a single line. I’m not really aware of using this in my own practice. Despite repeated attempts at object-oriented design I still tend to be quite "procedural".

I’m still in the "learning" phase of collaboration, for the first time in a while I’m working on code with other people (and it is a bit of a shock for all concerned), I still can’t abide by meetings but the experts can’t abide some of them (the ones with no direction).

I found this a bit of a "feel good" book, I share at least some of the habits of software design experts! I probably wouldn’t buy it for a personal read but if you have a coffee table in your software company this book would fit right in.

Book review: Ask a historian by Greg Jenner

ask_a_historianAsk a historian by Greg Jenner is a bit of a change of tack for me. It is a list of 50 questions to a historian, Greg Jenner. Each answer is conversational in style, a couple of thousand words at most, pitched at a level that my fairly bright 10 year old would understand although the content is such that I would be judicious in just sharing it with him. Jenner works on the TV series Horrible Histories which, amongst other things, puts historical incidents to modern pop tunes. It is highly educational and a firm favourite for all ages in our household!

Fifty questions is more than I can review individual, so I will simply outline the style of the questioning and highlight some of my favourites. They are divided into 12 thematic chapters with 4 or 5 questions in each chapter.

Chapter 1 – Fact or Fiction

2 – Is it true they put a dead pope on trial? Yes, it is true, a subsequent pope dug him up in order to do this! The papacy was a fairly wild institution particularly in the 9th century AD with a total of 24 popes in the period 896-904. Contrasting with a total of 5 in my 50 year life. The 9th century popes did not die of natural causes, their successors helped them along the way.

3 – Atlantis proves aliens are real? – There questions that make Jenner angry (not at the questioner), this is one of them. Jenner’s concern is two-fold on this, the first is the implication that non-Europeans couldn’t possibly have done all of these magnificent things – it must have been aliens – which is rather insulting. Secondly, the alien conspiracy theories often have their roots in Nazism.

Chapter 2 – Origins and Firsts

6 – When was the first Monday? No historian likes to be pinned down on a "first" but the origins of the days of the week go back a long way. There is some evidence that the Babylonians used a seven day cycle, it fits neatly into the Lunar month, but the seven day week was definitely in place by 2,500 years ago with the Jewish religion celebrating a Sabbath every seven days. There were other options, the ancient Egyptians celebrating a ten day week Etruscans and early Romans following an 8 days week (labelled with letters A to H).  

8 – When did birthdays start being celebrated? It is comforting to realise that we’ve been celebrating our birthdays for at least 2500 years. A birthday party invitation was found at Vindolanda, a Roman fort on Hadrian’s Wall.

Chapter 4 – Food

15 – How old is curry? I found it interesting that the heat we most associate with curry, produced by chillies, is the result of an import from South America. Also it is a bit chastening that "curry" is largely an invention of the British, a bastardisation of  a very diverse Indian cuisine.

Chapter 5 – Historiography

19 – Who names historical periods? This turns out to be a surprisingly difficult question, historians don’t necessarily agree on the extents of a period (like the Long Eighteenth Century), periods do not neatly delineate time – they overlap, and vary across the world. Periods like "Victorian" are ridiculously large and encompass massive changes in social and economic conditions. Finally, the inhabitants of a period may be unhappy with where they have been placed – the Tudors would not have liked being called Tudors.

Chapter 6 – Animals & Nature

23 – When did we start keeping hamsters as pets? All I can say on this question is that hamsters are creatures full of rage.

Chapter 11 – Language & Communications

45 – Where names for places in other languages come from? I liked this question, in large part because I remember travelling out of Pisa on a bus wondering why I’d never heard of the obviously large city of Firenze which I kept seeing on signs (it is the city I know as Florence). The names locals give places are endonyms and those that foreigners provide are exonyms. In the days of rapid and communication, essentially since the beginning of the 19th century there has been a tendency for exonyms and endonyms to be one and the same, give or take a bit of pronunciation. Bécs is the Hungarian name for Viennna, known as Wien by the Austrians. Vienna was at the border of the Magyar empire, and basically they called it "gateway". 

Chapter 12 – History in Pop Culture

49 – Why do we care so much about the Tudors? I liked this question because it hints at something I have seen elsewhere about Newton, and it occurs regarding Anne Boleyn’s purported 3rd nipple in an earlier question in this book. These stories were promoted by supporters or opponents in the years after a dynasty or person had died because they supported a preferred narrative and their influence persists for centuries.

The book finishes with a rather nicely crafted Recommended Reading section, and perhaps this is the point of the book – not as an end in itself but an introduction to a range of books for a more in depth view. Ask a historian would be an excellent holiday read, I must admit I prefer something more substantial on a single subject.