July 2022 archive

Book review: Data mesh by Zhamak Dehgani

This book, Data mesh: Delivering Data-Driven Value at Scale by Zhamak Dehghani essentially covers what I have been working on for the last 6 months or so, therefore it is highly relevant but I perhaps have to be slightly cautious in what I write because of commercial confidentiality.

The Data Mesh is a new design for handling data within an organisation, it has been developed over the last 3 or 4 years with Dehghani at the Thoughtworks consultancy at the core. Given its recency there are no Data Mesh products on the market so one is left build your own on the basis of components available.

To a large degree the data mesh is a conceptual and organisational shift rather than a technical shift, all the technical component parts are available for a data mesh, less programmatic glue to hold the whole thing together.

Data Mesh the book is divided into five parts, the first describes what a data mesh is in fairly abstract terms, the second explains why one might need a data mesh, the third and fourth parts about how to design the architecture of the data mesh itself, and the data products that make it up. The final part is on “How to get started” – how to make it happen in your organisation.

Dehghani talks in terms of companies having established systems for operational data (data required to serve customers and keep the business running such billing information and the state of bank accounts), the data mesh is directed at analytical data – data which is derived from the operational data.  She uses a fictional company, Daff, Inc. that sounds an awful lot like Spotify to illustrate these points. Analytical data is used to drive machine learning recommender systems, for example, and better understanding of business, customer and operations.

The legacy data systems Data Mesh describes are data warehouses and data lakes where data is managed by a central team. The core issue this system brings is one of scalability, as the number of data sets grows the size of the central team grows, and the responsiveness of the system drops.

The data mesh is a distributed solution to this centralised system. Dehghani defines the data mesh in terms of four principles, listed in order of importance:

  1. Domain Ownership – this says that analytical data is owned by the domains that generate it rather than a centralised data team;
  2. Data as a product – analytical data is owned as a product, with the associated management, discoverability, quality standards and so forth around it. Data products are self-contained entities in their own right – in theory you can stand up the infrastructure to deliver a single data product all by itself;
  3. Self-serve data platform – a self-serve data platform is introduced which makes the process of domain ownership of data products easier, delivering the self-contained infrastructure and services that the data product defines;
  4. Federated computational governance – this is the idea that policies such as access control, data retention, encryption requirements, and actions such as the “right to be forgotten” are determined centrally by a governance board but are stored, and executed, in machine-readable form by data products;

For me the core idea is that of a swarm of self-contained data products which are all independent but by virtue of simple behaviours and some mesh spanning services (such as a data catalogue) provide a sum that is greater than the whole. A parallel is drawn here with domain-driven design and microservices, on which the data mesh is modelled.

I found the parts on designing the data mesh platform and data products most interesting since this is the point I am at in my work. Dehghani breaks the data mesh down into three “planes”: the infrastructure utility plane, the data product experience plane, and the mesh experience plane (this is where the data catalogue lives).

We spent some time worrying over whether it was appropriate to include data processing functionality in our data mesh – Dehghani makes it clear that this functionality is in scope, arguing that the benefit of the data product orientation is that only a small number data pipelines are managed together rather than hundreds or possibly thousands in a centralised scheme.

I have been spending my time writing code, which Dehghani describes as the “sidecar”, common code that sits inside the data product to provide standard functionality. In terms of useful new ideas, I have been worrying about versioning of data schema and attributes – Dehghani proposes that “bitemporality” is what is required here (see Martin Fowler’s blog post here for an explanation). Essentially bitemporality means recording the time at which schema and attributes were changed, as well as the time at which data was provided and recording the processing time. This way one can always recreate a processing step simply by checking which set of metadata and data were in play at the time (bar data being deleted by a data retention policy).

Data Mesh also encouraged me to decouple my data catalogue from my data processing, so that a data product can act in a self-contained way without depending on the data catalogue which serves the whole mesh and allows data to be discovered and understood.

Overall, Data Mesh was a good read for me in large part because of its relevance to my current work but it is also well-written and presented. The lack of mention of specific technologies is rather refreshing and means the book will not go out of date within the next year or so. The first companies are still only a short distance into their data mesh journeys, so no doubt a book written in five years time will be a different one but I am trying to solve a problem now!

Book review: The Art of More by Michael Brooks

the_art_of_moreThe Art of More by Michael Brooks is a history of mathematics written by someone whose mathematical ability is quite close to mine – that’s to say we did pretty well with maths at school but when we went to university we reached a level where we stopped understanding what we were doing and started just manipulating symbols according to a recipe.

The book proceeds chronologically starting with origins of counting some 20,000 years ago and finishing with information theory in the mid-20th century with chapters covering arithmetic, geometry, algebra, calculus, logarithms, imaginary numbers, statistics and information theory.

It is probably chastening to modern mathematicians and scientists that much of the early work in maths on developing the number system, including zero and negative numbers, was driven by accounting and banking. Furthermore, much of the early innovation came from China, India and the Middle East with Western Europe only picking up the ideas of zero and negative numbers in around the 13th century.

Alongside the development of the number system, the ancient Greeks and others were developing geometry, the ancient Greeks seemed to go off numbers when they discovered irrational numbers – those which cannot be expressed exactly as a ratio of integers! Geometry is essential for construction, surveying, navigation and mapmaking – sailors have often been competent mathematicians – through necessity. Geometry also plays a part in the introduction of accurate perspective in drawings and paintings.

Complementing geometry is algebra, developed in the Arabic world. Our modern algebraic notation did not come into being until the 16th century with the introduction of the equals sign and what we would understand as equations. Prior to this problems were expressed either geometrically or rather verbosely.

Leading on from algebra was calculus – the maths of change. It started sometime around the beginning of the 17th century with Kepler calculating the volumes of wine barrels whilst he was preparing for his wedding. There was further work on the infinitesimals through the century before the work by Newton and Leibniz who are seen as the inventors of calculus. I was struck here by how all the key characters in the development of calculus Newton, Leibniz, Fermat, Descartes and the Bernoullis all sounded like deeply unpleasant men. Is this the result of the distance of history and the activities of various proponents for and against in the intervening centuries? Or were they really just deeply unpleasant men?

Doing a lot of calculation started to become a regular occurrence for sailors, as well as people such as Kepler and Newton working on the orbits of various celestial bodies. John Napier’s invention of logarithms and his tables of logarithms, published in 1614 greatly simplified calculations. It converted multiplication and division into addition and subtraction of values looked up in his tables of logarithms. The effort to create the tables was massive, it took 20 years for Napier to prepare his first set of tables, containing millions of values. Following Napier’s publication in 1614 logarithms reached their modern form (including natural logarithms) by 1630. In addition mechanical calculating devices like the slide rule were quickly invented. I grew up in a house with slides rules, although by the time I was old enough to appreciate them electronic calculators had taken over. Napier was also an early promoter of the modern decimal system. Logarithms also link to exponential growth, highly relevant as we still wait for the COVID pandemic to subside.

Historically the next area of maths is the invention of imaginary numbers, if you don’t know what these are then I’m not going to be explain in the space of a paragraph! There is a link here with natural logarithms through Euler’s identity which somewhat ridiculously manages to link e, pi and i in one really short equation. I was not previously familiar with Charles Steinmetz who introduced complex numbers into the analysis of electrical circuits responding to alternating currents – although it is a very elegant way of handling the problem and a method I used a lot at university. Largely when we talk about complex numbers we are discussing the addition of i, the square root of -1, to our calculations. But there are additionally quaternions, invented by William Hamilton, which add three complex numbers: i,j and k to the real numbers but the limit is octonions – a system of seven complex numbers and the real numbers. I am curious as to why we cannot have more than 7 flavours of complex numbers.

Statistics is my area of mathematics, I’m a member of the Royal Statistical Society. I think the thing I learned from this chapter was that the word "statistics" has its origins in German and "facts about the state". I quite liked Brooks’ description of p-values which seemed particularly clear to me. Brooks highlights some of the sordid eugenicist history of statistics, as well as the more enlightening work of Florence Nightingale and others.

The book finishes with a chapter on information theory, largely based on the work of Claude Shannon but with roots in the work of Leibniz and George Boole. George Boole invented his Boolean logic in an attempt to understand the mind in the mid-19th century but his work on "binary" logic was neglected for 70 or so years until it was revived by Shannon and other pioneers of early computing.

This is a fairly informal history of mathematics, I found it very readable but it includes a number of equations which might put off the completely non-mathematical.