The Saga Of Distributed Transactions
“Distributed transactions are icebergs: they can be hard to see, and they can sink your ship”- Graham Lea
Introduction:
Transactions are fundamental to any software system in order to maintain consistency of data when it is being concurrently read and modified by multiple processes. We are all extremely well versed with the ACID semantics in database transactions that provides strong guarantees that in any concurrent system each thread or process has exclusive access to the data while it is interacting with it. ACID compliant transactional models are absolutely perfect solution to enforce data or business invariants when we have one single monolithic application directly accessing a single transactional data source like a relational database. However, they are far less perfect when dealing with a distributed application which is interacting with multiple transactional data sources. In this scenario, we are familiar with using the de facto standard of XA that uses two-phase commit to ensure all participants in the transaction either commit or rollback. This is a suboptimal approach that breaks down even when a small number of participants are involved in a transaction. It minimizes system throughput and severely limits our ability to scale out these transactional resources. It relies on a centralized transaction coordinator which introduces a single point of failure. Again, with increasing concurrency and more resources, the risk of latency within such a system becomes unacceptable since we are bounded by the slowest and most unreliable resources within the system.
Most modern software systems are being increasingly architected as large collection of distributed microservices, each with their own data sources. Furthermore, these data sources are no longer restricted to being a conventional SQL relational database. For example, they could be either be a NoSQL data stores like MongoDB or Amazon Dynamo DB, Simple file storage services like Amazon S3, Messaging queues like Kafka or RabbitMQ and application caches like Redis. Any business transaction will typically span all or multitude of these resources to complete. Many of these resources may not even provide any strong ACID compliant transactional guarantees. Again, many of these resources in themselves need to be extremely distributed and geographically dispersed to allow for scalability and fault tolerance. All of these reasons rules out the possibility of using any kind of distributed transaction protocols like XA to maintain strict data consistency. None of the large cloud providers provide any support for two-phase commit.
So, how do we make sure that we still maintain data consistency and enforce business invariants when dealing with such a highly distributed and loosely coupled system? How do we account for the perils of an asynchronous, unreliable network that underpins this system? Who is responsible for orchestrating the business update across multiple resources. What is the best approach for building a fault tolerant (non-blocking) consensus protocol that flags a transaction as successful? In this blog post, I will discuss one approach that may solve the problem of doing transactions correctly in a highly distributed, highly scalable system without encountering many of the drawbacks mentioned earlier. Enter the SAGA pattern which was originally published in ’80 as a solution for distributed transactions on relational databases but applies equally well to distributed microservices.