Designing Data-Intensive Applications by Martin Kleppmann does pretty much what it says in the title. The book provides a lot of detail on how various types of databases and database functionality work, and how these can be plumbed together to build applications. It is reminiscent of Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, in the sense that it provides a broad overview of a range of different data systems which are specialised for different applications. It is authoritative and well-written. Seven Databases is more concerned with the specifics of particular NoSQL databases whilst Designing Data-Intensive Applications is concerned about data applications rather than just the underlying database.
The book is divided into three broad sections covering foundations of data systems, distributed data and derived data. Each chapter starts with a cartoon map of the territory, which I thought would be a bit gimmicky but it serves as a nice summary of what the chapter covers particularly in terms of the software available.
The section on data systems talks about reliability, scaleability and maintainability before going on to discuss types of database (i.e. relational, graph and so forth) and some of the low-level implementation of data storage systems such as hash indexes and B-trees.
Reliability is about a system returning responses in a timely fashion, Amazon have observed sales drop by 1% for every 100ms of delay, other have reported a drop in consumer satisfaction of 16% with 1 second slowdown. The old academic in me twitches at providing these statistics without citing the reference. However, Designing Data-Intensive Applications is heavily referenced.
There is some interesting historical detail, including the IMS database which IBM built for the Apollo space program in the late 1960s (which is still available today), and the CODASYL database model for graph-like databases from a little later. Its interesting to see how some of these models have been revisited recently in light of the advent of fast, large memory in place of slow disk or even tape drives.
I was introduced to databases rather late in my career, they are not really a core part of the scientific computing background I have. Learning the distinction between OLAP (analytics) and OLTP (transactions) databases was useful. Briefly, transactional databases are optimised to work on single rows and provide fairly strong guarantees on transactional integrity. The access pattern for analytics databases is different, typically analytical workflows want to take the contents of an entire column and carry out aggregations and calculations over the whole column. Transactions are not so important on such databases but consistency is important, a query may take a long time to run but it should provide results as if it ran on the database at a single point in time. These workflows are better serviced by so-called column-stores such as Vertica.
The section on distributed data systems covers replication, partitioning, transactions and consensus. The problem with distributed systems is that you never know things have failed for ever, and its difficult to know what order things have happened in. This reminds me a bit of teaching special relativity to physics undergraduates long ago.
It is hard to even be able to rely on timekeeping on servers. I found this a bit surprising, when we put our minds to measuring time we can be incredibly accurate. GPS time signals have an accuracy significantly better than microseconds, yet servers synced well using NTP (Network Time Protocol) achieve something like 100 milliseconds – a factor of thousands poorer. And this accuracy is only achieved if everything is configured correctly. This is important because we therefore cannot rely on timestamps to provide a unique order for events across multiple servers, nor can we even rely on timestamps synced with NTP to be always increasing!
The two big themes in terms of databases are transactions and consensus. These are the concepts that provide the best assurance on the integrity of operations and their success over distributed systems. I used the word “assurance” rather than “guarantee” deliberately because reading Designing Data-Intensive Applications it becomes clear that perfection is hard to achieve and there are always trade-offs. It also highlights the problems of the language used to describe features. Some terms are used in different ways in different contexts.
The derived data sections starts with praise for the Unix way of piping data between simple command line scripts, Data Science at the Commandline covers this area in much more detail. It then goes on to discuss the MapReduce ecosystem and the differences between batch and stream processing. This feels like a section I will be returning to.
The book finishes with some speculation as to the future of the field, the two thoughts that stuck with me are the idea of federated databases, systems which use a common query language to interface with multiple different datastores. The second idea is that of unbundling functionality so that, for example, data may be stored in a standard SQL database for unique ID based queries and in Elasticsearch for full-text search queries – in some ways these are simply different facets of the same idea.
Designing Data-Intensive Applications is a big book with no padding, it is packed with detail including many references, but remains readable. Across a fair number of titles this is definitely one of the better technology books I have read.