Debunking Data-Intensive Applications

February 17, 2021

9 min read

“Our motto is still alive and to the point: Pessimism of the intellect, optimism of the will.”

— Antonio Gramsci, adapted from Romain Rolland

There is something fascinating about solving a seemingly impossible problem. This is perhaps what The Queen’s Gambit and data conflict resolution have in common. Watching Martin Kleppmann’s talk on the latter at the 2016 GOTO conference, I realized that this is what’s really interesting about designing distributed data systems: It’s a seemingly impossible attempt to decide what’s real amid conflicting accounts—and yet, somehow, we manage to do it anyway. This talk, many recommendations, and the ever-growing importance of distributed data in backend engineering all prompted me to read the second section of Kleppmann’s book, Designing Data-Intensive Applications, where he discusses the core concepts and tradeoffs involved in designing distributed data systems, from replication and partitioning to transactions and consensus algorithms. I walked away from an initial read of this section of the book with a lot of my previous ideas about distributed systems debunked—from the meaning of concurrency, to the nuances of popular concepts such as ACID and CAP, to the reliability of clocks.

2020%205%203%20An%20Error%20Alerting%20System%20of%20One%20s%20Own/Screenshotfrom2019-01-2216-45-38.png A map of the sea of distributed data, where each island represents a chapter in the book. Source: Martin Kleppmann’s GitHub repo of DDIA literature references.

Concurrent doesn’t mean simultaneous

The reason this challenge of reconciling conflicting accounts arises in distributed systems to begin with is that we have more than one node. In distributed systems using single-leader replication, this isn’t an issue: Only one node can be the leader, and only that leader node can accept writes, copying them over to the remaining read-only nodes. Of course, replication lag and even data loss can still occur, causing different nodes to have different versions of the data, but a single source of truth for writes still exists. A whole other beast reveals itself when we consider multi-leader and leaderless replication, where more than one node—sometimes even all nodes!—can accept writes. The problem then becomes not just ensuring that all nodes have the same data, but agreeing on what that data should be in the first place. This is because a write processed by Node A may conflict with another write processed by Node B; in this situation, how do we decide who’s “right”?

Designing a distributed system with this type of replication then becomes about managing concurrency, where concurrency here doesn’t necessarily mean simultaneity. We don’t care whether or not these writes happened at the exact same time; what causes them to conflict is that they happened independently of each other and didn’t know about each other. Node A was acting on its own, unaware of the actions of Node B, and vice versa. Kleppmann goes over a range of strategies that can be used to detect and handle such concurrent writes, but the key takeaways for me here were that:

this idea of concurrent writes is fundamental to conflict resolution in distributed data systems, and
there isn’t necessarily a “right” way to resolve these conflicts to end up with an “accurate” state of the data; rather, each method has its tradeoffs, and a given distributed system will provide guarantees based on the tradeoffs of the method it uses (think, for example, of collaborative text editors that don’t allow users to edit offline to avoid having to manage this conflict).

The letter “C” doesn’t belong in “ACID”

These guarantees can differ, but one way to meet certain safety guarantees is by using a transaction, which groups multiple reads and writes into one logical operation that is executed at once. As Kleppmann puts it, “either the entire transaction succeeds (commit) or it fails (abort, rollback). If it fails, the application can safely retry…because it doesn’t need to worry about partial failure.” How can a transaction do that? By executing these reads and writes as one atomic unit, isolated from other units, and ensuring that the result of that unit is consistent and durable—in other words, by implementing ACID.

In popular software engineering culture, ACID is usually viewed as a formula for preserving data integrity, in opposition to BASE (Basically Available, Soft state, and Eventual consistency) systems. However, Kleppmann argues that we should take ACID with a grain of salt as there is no clear-cut, universal definition for it. (He even goes so far as to say that it has now become “mostly a marketing term.”) This holds true for all four guarantees, but Kleppmann zeroes in specifically on consistency, which he claims shouldn’t be in the ACID abbreviation to begin with.

Consistency is an ambiguous term, but it’s used in the ACID sense specifically to refer to the validity of application data according to certain invariants. For example, if you have a hospital management system that guarantees that every patient will always be assigned to one doctor, all writes in all system transactions must meet that guarantee. This is not the responsibility of the database management system, but of the application itself. Sure, certain database-level constraints can help; in the previous example, we could add a patients.doctor_id column as a NOT NULL foreign key in a relational database. But ultimately, unlike atomicity, isolation, and durability, defining and ensuring consistency according to certain invariants is a property of applications, not database transactions.

The CAP theorem isn’t useful anymore

Another abbreviation that often comes up when discussing data integrity in distributed systems is the CAP theorem, which stands for Consistency, Availability, and Partition Tolerance. The theorem posits that a system that requires strong consistency will be less tolerant of network faults than one that doesn’t. Strong consistency is used here in the sense of linerizability; in other words, a level of replication lag so low it’s as if there’s only one replica. Like ACID, CAP is quite popular. In fact, before I read this section of the book, my understanding was that distributed data == the CAP theorem; I thought it was a helpful heuristic through which to conceptualize tradeoffs in distributed data systems.

Kleppmann acknowledges that when the theorem was introduced by Eric Brewer in 2000, it helped provide a framework through which to think about a known database tradeoff, but he argues that it’s no longer useful now. This is because CAP is often presented as a “pick 2 out of 3” type of situation, but network partitions are a kind of fault and so aren’t actually under database designers’ control. The theorem also adopts a particular definition of availability that deviates from the industry standard. In addition to the confusion arising from these two reasons, CAP also has a narrow scope; it only considers one type of consistency (strong consistency) and one type of fault (network partitions). The reality of distributed systems today is much more complex.

Clocks aren’t reliable

Here’s another theorem: Since time is a social construct, and since our software systems model our sociality, these software systems rely on time as well. Systems need to know time in two senses: durations (for example, how long ago a request was sent) and points (the timestamp of a given system event). While time can be a fickle friend on one machine, things start to become even more complex when we consider a distributed system communicating over a network. That communication takes time, and since network delays are variable and unpredictable, we can’t always tell how long it will take for a message to go from one machine to another, making it difficult to order distributed system events chronologically.

Network latency isn’t the only problem. Each machine has its own slightly inaccurate, drift-prone hardware clock, meaning machine clocks in the same distributed system can go out of sync. (And this isn’t even taking into consideration nuances like leap seconds.) They can be synchronized by using something like Network Time Protocol (NTP), where each machine adjusts its clock based on the time reported by a GPS receiver to a server group. However, even NTP time can be inaccurate, and this synchronization is itself prone to network latency, firewalls, and failures if the machine clock is too far off from NTP time.

Time turns out to be even more of a social construct than we previously thought. This doesn’t mean it can’t be useful though. A degree of clock synchronization can be achieved for hard real-time systems that depend on it. For other types of systems, it can be worth it to guarantee clock accuracy within a certain confidence interval or to use logical instead of physical clocks. Instead of measuring current time or durations, logical clocks determine the ordering of events by incrementing a counter every time an event happens.

Conclusion

There is a lot of confusion surrounding distributed systems. Ambiguous terms are overloaded with multiple meanings, historical heuristics are no longer useful, and marketing buzzwords abound. But even after removing all the fluff, distributed systems are themselves intrinsically complex to reason about and design. As single-node problems multiply across multiple machines and new problem domains open their doors, the core challenge of a distributed system presents itself: How does it decide what’s real when each of its machines offers a different version of reality? As distributed system enthusiasts, we can help it do so by debunking our erroneous ideas and seeking to understand these systems as they are. As we do so, we develop a healthy pessimism; we understand the nuances of potential issues and the tradeoffs of the methods we can use to solve them. We come to appreciate that each system model can only offer certain guarantees under certain conditions. We realize the discrepancies between theoretical algorithms, system models, and practical realities. And eventually, this pessimism gives us the optimism—and the fascination—to solve a seemingly impossible problem.

Thanks to Ahmed Abdelwahab for his comments on the first draft of this post, and to Ehab Ashraf for the idea to write it.

Share on: