Your company is growing steadily and so is your data. Everyday, you reap the benefits of data driven decisions and enjoy all the insights you get from that beefy database. Despite all the exciting things ahead, you can’t help but wonder how long this monolithic storage — this faithful Swiss knife — will continue to support your business needs.
Maybe you already started to notice increased latency on certain workloads, or even faced downtime due to this single point of failure. You also know that you cannot store an infinite amount of data on a single, physical database, but you don’t intend to stop collecting it.
On a beautiful day, you then decide that it’s time to shop for a new, shiny Distributed Database System (DDBS), and you come up with this list of requirements:
- Highly available
- Globally distributed
- Very low latency for read and write queries
- Strongly consistent
- Elastically scalable storage
I may have a bad news for you.
The bad news is that you will never find a distributed database that can successfully optimize for all these properties simultaneously. On the other hand, by understanding the limitations of distributed systems as well as the specificities of your own data workloads, you will surely find a tradeoff that suits your needs.
Let’s start by looking at inherent limitations of distributed systems, with the help of 2 theorems.
In 1998, Eric Brewer first published its CAP principle as follows:
Any networked shared-data system can have at most two of three desirable properties:
- consistency (C) equivalent to having a single up-to-date copy of the data;
- high availability (A) of that data (for updates); and
- tolerance to network partitions (P).
If you are like me, reading this theorem probably didn’t enlighten you that much. To have a better understanding of its implications on the design of distributed systems, let’s first rephrase the 3 properties for mere mortals.
- Consistency: Clients always see the up-to-date version of a data point
- Availability: Node failures do not prevent other surviving nodes from serving requests
- Partition tolerance: The system continues to operate despite network partitions
Now let’s illustrate this with an example.
Here we have a simplistic DDBS comprised of 4 nodes. A user can send queries to the system via some interface that delegates the call to any database node. We also see that this system is facing a network partition — this temporary failure prevents nodes from each side of the partition to communicate with each other, so data synchronization between Node A and the others is not possible. On the other hand, Node A is still accessible from the interface, so a user could still query and update data from it. Here is where it becomes (a little bit) more interesting:
- While data synchronization between some nodes is not possible, if you want your system to be strongly consistent, you cannot allow writing to one side of the partition and querying from the other side, because you can end up with different versions of the same data point. This means you need to render unavailable one side of the partition and send read and write queries to the other side of the partition only. In that case, to be strongly consistent, your system forfeited high availability.
- While data synchronization between some nodes is not possible, if you want your system to be highly available, thus don’t have to drop requests, you need to allow reading and writing to both sides of the partition. This can lead to inconsistent responses, depending what side of the partition handles the requests. To be highly available, your system forfeited strong consistency.
Now, the “2 out of 3” thing is misleading for various reasons (and Dr. Brewer acknowledged that in his follow up paper CAP Twelve Years Later: How the “Rules” Have Changed). If your distributed system continues to operate during a network partition, you have the “P” property by default (hooray). When designing your system, you definitely want it to work under such circumstances, so your choice is truly between “C” and “A” only.
Another useful thing to note is the asymmetry of this Consistency vs Availability tradeoff. The strong consistency property is something that you can guarantee, in the sense that you can design your system to always be strongly consistent. Adversely, your system can never guarantee high availability. This property is typically measured in uptime in “number of nines”, and no system can guarantee 100% uptime.
Finally, the CAP theorem only deals with distributed systems facing network partitions, and does not say anything about tradeoffs in a normal regime. Considering the popularity of cloud providers such as AWS, GCP and Azure, and that network partitions are less frequent now that we stopped hosting stuff in our grandmother’s basement, tradeoffs during normal regime should be the main driver of the design of our systems.
This brings me to the PACELC theorem, which is an extension of CAP that also states necessary tradeoffs when there are no network partitions.
Interestingly enough, it seems like the PACELC theorem went a bit unnoticed, despite being more practical than CAP and giving more general insights on the limitations of distributed systems.
We can summarize it that way: Under a network Partition, the distributed database system needs to choose between Availability or Consistency (based on CAP). Else (under normal condition), it needs to find a tradeoff between Latency and Consistency.
In the paper Consistency Tradeoffs in Modern Distributed Database System Design, its author Daniel J. Abadi explains it that way:
As soon as a DDBS replicates data, a tradeoff between consistency and latency arises. This occurs because there are only three alternatives for implementing data replication: the system sends data updates to all replicas at the same time, to an agreed-upon master node first, or to a single (arbitrary) node first. The system can implement each of these cases in various ways; however, each implementation comes with a consistency/latency tradeoff.
An example might help:
- John changes his username from
Betty(Kung Pow reference, you are welcome) by sending a write request to the DDBS, and Node A handles it.
- Reminder: if you want your system to be strongly consistent, it means that all subsequent requests that users make to retrieve John’s username should return
Betty, even if Node B, C or D handle them.
- For this to be true, a possibility is to only let Node A serve the subsequent read requests until Node B, C and D have been fully synchronized. This will incur additional latency to users that are physically far away from Node A.
- Another possibility is to wait for Node B, C and D to be fully synchronized with Node A before letting anyone query John’s username. In that case, additional latency is also incurred.
- A mix of the 2 previous options is also possible: a quorum protocol (see Paxos or Raft) could ensure that a certain number of replicas have been updated before allowing to read John’s username. In that case, latency is inherent to the protocol, and the tradeoff of synchronous vs asynchronous replica updates still holds.
Of course, if instead you want your system to be as low latency as possible, it’s very intuitive to see that you cannot afford to wait for all replicas to be synchronously updated before allowing subsequent read requests. You can certainly improve your write throughput by not waiting for a complete synchronization, but you also lose your guarantee of data consistency. If Node A takes care of updating John’s username but your system does not wait for other nodes to be up-to-date before allowing reads to this record, some users quickly asking for his username might get
Master Pain, while others get
In short, the main takeaway from this theorem is that under normal conditions, your DDBS can be both highly available and strongly consistent, but aiming for more consistency necessarily results in increased latency. This is where a good understanding of your business requirements, as well as the nature of your data workloads, will help you choose the right latency/consistency tradeoff. That being said, let’s dive a bit more deeper into this consistency property.
Consistency and Isolation levels
The latency/consistency tradeoff is not only limited to distributed database systems. In fact, a lot of transactional databases like PostgreSQL and MySQL are also designed with this tradeoff at their core, and they provide configurable isolation levels that directly impact this tradeoff.
Here is a trick question to introduce the different isolation levels. Let’s say you have the following
| id | username |
| 1 | Betty |
| 2 | McSwinster |
And you run this transaction that selects all usernames, then waits for 1 minute and re-selects them before returning the results:
SELECT username FROM users;
WAITFOR DELAY '00:01:00'
SELECT username FROM users;
But during the wait period,
- A concurrent transaction renames
- Another one starts to rename
BigShot, but does not have time to commit before the initial query returns, and
- Finally, another one adds a new row:
At last, I ask you my one million dollar question: what result can we expect from that
SELECT request up there? The answer is… it depends on the isolation level you set to your database!!1!
Basically, the isolation level determines to what extent a transaction is visible to the outside world. Stronger isolation means stronger data integrity, but also higher latency. Weaker isolation means better throughput, but you are vulnerable to anomalies and inconsistencies. More specifically dirty reads (can query uncommitted data), non-repeatable reads (rows getting queried change during the transaction), and phantom reads (while records are being read by a transaction, new rows can be added or removed concurrently that affect the result).
This is equivalent to “no isolation”. Read uncommitted means that reading or writing to a table does not lock the records you are interacting with. A concurrent write request could overwrite the rows you are currently trying to update, and you can read “dirty” rows that have been updated but not yet committed. You can guess that not supporting isolated transactions at all means concurrent queries don’t have to wait for locks, thus optimizing latency. On the flip side, this isolation level is prone to dirty reads, non-repeatable reads and phantom reads, which makes a lot more to handle at the application level. Regarding the trick question above, a system configured with this isolation level would have returned
[Sally, BigShot, LucieLajoie].
Read committed provides more data consistency than read uncommitted, by locking rows during updates. This prevents dirty reads by only allowing reads on records that have been committed. On the other hand, read requests do not lock the data items involved, so such system is prone to non-repeatable reads and phantom reads. In our scenario, a database with this isolation level would have returned
[Sally, McSwinster, LucieLajoie]. This is the default isolation level for PostgreSQL and Oracle Database, for instance.
Locks are obtained for both read and write requests. This is a more consistent upgrade over read committed that does not allow to update records that are being read by a concurrent transaction. This isolation level is only prone to phantom reads, because records can still be added while your select query is in progress. In our scenario, a database configured that way would have returned
[Betty, McSwinster, LucieLajoie]. This is the default isolation level for MySQL, for instance.
Records are locked during read and write queries, as well as access structures, like indexes. This prevents records from being added while a concurrent read query targeting matching records is in progress (like our addition of
LucieLajoie while the read has not yet returned). A database with that isolation level is then free from dirty reads, non-repeatable reads and phantom reads, but keeps on granting locks like there is no tomorrow, which negatively affects latency. In our scenario, this isolation level would have yielded the following result set:
[Betty, McSwinster]. This is the default isolation level for VoltDB, for instance.
A lot of use cases could justify using each of these 4 isolation levels. Once again, the best option for you directly depends on your business area, service level agreement and workloads.
Consolidated Learning: Replication with PostgreSQL
Now that we have a better understanding of how Consistency, Availability and Latency are interrelated, let’s take a look at a simple case where one wants to go from a monolithic PostgreSQL database to a distributed one using native replication. Despite being one of the simplest DDBS example I could choose, it is a very common one, and it illustrates well the inability to optimize all properties simultaneously, which holds true for all distributed systems.
Let’s say you have a single PostgreSQL database and its failure would render your whole application unavailable. This thought would surely prevent you from sleeping, but luckily you know that PostgreSQL supports synchronizing write transactions to other databases using native streaming replication. In short, all write requests go to a database called the primary, and those transactions are replayed to other read-only replicas by streaming the write-ahead logging records.
The benefits of such systems are manifold:
- Better read performance by load balancing reads to more nodes
- Increased availability by allowing replicas to take over the primary in case of failure
- Reduced latency on reads by placing replicas closer to your customers
- Increased durability by storing copies of your data on more disks
On the other hand — and I hope this starts to become more and more obvious — you don’t get these benefits without sacrificing something else! Depending on how you choose to replicate data from the primary to the replicas, you will either increase write latency, or forfeit strong consistency:
- Synchronous replication: write transaction is considered complete only when it has been copied to all replicas. This will definitely increase your write latency, as your system becomes as slow as your slowest node. Note that PostgreSQL supports 2 types of synchronous replications, each of them having its latency/consistency tradeoffs.
- Asynchronous replication: you write transaction completes as soon as the primary DB considers it done, without any guarantee about the state of this data item on the replicas. This will render your system eventually consistent, as you cannot guarantee that all replicas will return the latest version of the data.
If you are instead looking for low write latency or increased storage capacity and still want to stick with PostgreSQL, there is no native solution to that problem, but you might want to check available PostgreSQL extensions like Citus. Again, you’ll have to pick your poison because other tradeoffs are in sight.
All that being said, what is the best configuration, you ask me? Well, this is entirely dependant on your specific use case, so you tell me! On the bright side, you now know what to expect from such distributed systems and won’t be taken by surprise when optimizing for one property negatively impacts some others.
Life is short and there are lots of other subjects I would have liked to discuss here. I first wanted to shed some light on the Consistency vs Availability vs Latency tradeoffs at the heart of distributed systems, as I felt this was not common knowledge for a lot of software engineers. On another note, it seems like these tradeoffs get completely set aside by database marketers. Be it Aurora, Redshift, DynamoDB, Spanner, or FaunaDB , their documentation rarely states the necessary tradeoffs implied in their design. For example, to figure out that using Multi-AZ deployments (synchronous replication of data across availability zones) in Amazon RDS will negatively impact your write throughput, you really need to be actively looking for one single sentence in their documentation.
Once that is well understood, and considering you want to further your knowledge on the topic of DDBS, what should you be looking for next?
- OLTP vs OLAP workloads
- Data sharding
- Query optimization
- Quorum protocols for distributed transactions
- Column-oriented databases
- Specialized databases for timeseries, graphs, search indexes, etc.
- ThE OnLy LiMiT iS yOuR iMaGiNaTiOn