In this article, learn about the distributed saga pattern, how it helps ensure correctness & consistency in microservices, and how you can use AWS Step Functions as a Saga Execution Coordinator.

This is an aspirational transcript of a talk I gave on ‘Sagas with Step Functions’ for an AWS meetup.

📬 Get updates straight to your inbox.

Subscribe to my newsletter so you don't miss new content.

The Problem

Microservices has taken over software. While it has its benefits, the approach brings some added complexity.

The Death Star architecture

Imagine that we’re building a travel booking platform using the microservices approach.

A single high-level action involves calling low-level actions across different microservices

In the above diagram, a high-level business action (‘Booking a trip’) involves making several low-level actions to distinct microservices.

To handle client requests, we create a single-purpose edge service (a Backend for Frontend) that provides the logic of composing calls to all the downstream services. At its core, the Travel Agent Service is an orchestration layer that exposes coarse grained APIs by composing finer grained functionality provided by different microservices.

Transactions are an essential part of software applications. Without them, it would be impossible to maintain data consistency. However, in microservices we no longer have a single source of truth. State is spread across distinct services each with its own data store.

If all of the service calls completed successfully, great! But in the real world, failures occur on a regular basis. How do we handle partial executions - when a subset of requests in the high-level action failed?

With the microservices approach, you can’t just book a flight, a car, and a hotel in a single ACID transaction. To do it consistently, you would be required to create a distributed transaction.

Partial execution: the Book Flight request failed while the other service calls succeeded

If the flight reservation failed, would you like to keep the hotel and car? Probably not.

In this situation, we would need to implement some ad-hoc concurrency control logic in the edge Travel Agent service to handle any potential failures and try to recover from it in some way (Do we cancel the other bookings? What if the flight booking retried and succeeded later on? How do we know what state we are in?)

Without some kind of concurrency control mechanism, we risk having inconsistent data in our application - which is especially bad in distributed systems.

Understandably, this control logic can get very complicated.

Dealing with partial failures and asynchrony is hard… And that’s where distributed sagas come in.

Distributed Sagas

In distributed systems, business transactions spanning multiple services require a mechanism to ensure data consistency across services. The Distributed Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback. Distributed Sagas help ensure consistency and correctness across microservices.

The term saga was first used in a 1987 research paper by Hector Garcia-Molina and Kenneth Salem. It’s introduced as an conceptual alternative for long lived database transactions.

More recently, sagas have been applied to help solve consistency problems in modern distributed systems - as documented in the 2015 distributed sagas paper.

A Saga represents a high-level business process (such as booking a trip) that consists of several low-level Requests that each update data within a single service. Each Request has a Compensating Request that is executed when the Request fails or the saga is aborted.

For example, our travel booking platform’s ‘Book a Trip’ saga consists of the following Requests: BookHotel, BookCar, and BookFlight.

A Request has a corresponding Compensating Request that semantically undoes the Request. CancelHotel undoes BookHotel, CancelFlight undoes BookFlight and so on.

Note that certain actions are not undo-able in the conventional sense.

An email that was sent to the wrong recipient cannot be un-sent. However, we can semantically undo the action by sending another email that says ‘Sorry, please ignore the previous email.’ >

Compensating Requests semantically undoes a Request by restoring the application’s state to the original state of equilibrium before the Request was made.

Distributed Saga guarantee

Amazingly, a distributed saga guarantees one of the following two outcomes:

  1. Either all Requests in the Saga are succesfully completed, or

  2. A subset of Requests and their Compensating Requests are executed.

The catch is for distributed sagas to work, both Requests and Compensating Requests need to obey certain characteristics:

  • Requests and Compensating Requests must be idempotent, because the same message may be delivered more than once. However many times the same idempotent request is sent, the resulting outcome must be the same. An example of an idempotent operation is an UPDATE operation. An example of an operation that is NOT idempotent is a CREATE operation that generates a new id every time.

  • Compensating Requests must be commutative, because messages can arrive in order. In the context of a distributed saga, it’s possible that a Compensating Request arrives before its corresponding Request. If a BookHotel completes after CancelHotel, we should still arrive at a cancelled hotel booking (not re-create the booking!)

  • Requests can abort, which triggers a Compensating Request. Compensating Requests CANNOT abort, they have to execute to completion no matter what.

Sagas as state machines

How do we define a saga? As a state machine.

State machines have been a central concept in programming for a very long time, and are ideal for coordinating many small components with fast, predictable performance.

State machines consist of different states that each perform a specific task. The state machine passes data between components, and decides the next step in the operation of the application.

Distributed Saga implementation approaches

There are a couple of different ways to implement a Saga transaction, but the two most popular are:

  • Event-driven choreography: When there is no central coordination, each service produces and listen to other service’s events and decides if an action should be taken or not.

  • Command/Orchestration: When a coordinator service is responsible for centralizing the saga’s decision making and sequencing business logic.

In this guide, we’ll look at the latter. With the orchestration approach, we define a new Saga Execution Coordinator service whose sole responsibility is to manage a workflow and invoke downstream services when it needs to.

Saga Execution Coordinator

The Saga Execution Coordinator is an orchestration service that:

  • Stores & interprets a Saga’s state machine
  • Executes the Requests of a Saga by talking to other services
  • Handles failure recovery by executing Compensating Requests

Building such a service is a non-trivial exercise, but fortunately we can repurpose an existing service to accomplish our goal.

AWS Step Functions

AWS Step Functions is a workflow management service that can be used as an Saga Execution Coordinator to oversee the execution of our distributed sagas.

AWS Step Functions is a web service that enables you to coordinate the components of distributed applications and microservices using visual workflows. You build applications from individual components that each perform a discrete function, or task, allowing you to scale and change applications quickly.

Step Functions provides a graphical console to visualize the components of your application as a series of steps. It automatically triggers and tracks each step, and retries when there are errors, so your application executes in order and as expected, every time. Step Functions logs the state of each step, so when things do go wrong, you can diagnose and debug problems quickly.

The way Step Functions works is as follows:

It uses its own AWS States Language domain-specific language to define state machines:

There are seven state types you can use to create your state machine. Task can be used as Requests and Compensating Requests, while Choice and Parallel is used for control flow.

By definining state machines with the AWS States Language and Step Functions, your can coordinate components into workflows that match your business requirements, add retry logic, error handling, and more:

AWS also gives you a web interface you can use to debug and track the execution of your Step Functions state machines from the AWS console:

Hands-On with AWS Lambda and Step Functions

In this section, we’ll build a distributed saga for the travel booking example in the beginning of this guide.

A single high-level action involves calling low-level actions across different microservices

You can clone and run the example application on Github.

git clone git@github.com:yosriady/serverless-sagas.git
cd serverless-sagas
serverless deploy

Note that the above example requires you to have the Serverless framework and AWS credentials set up on your machine.

After deploying, you will get a series of AWS Lambda Function ARNs that you can use as the Requests and Compensating Requests in your saga.

Go to the AWS Step Functions console and create a new step function with the states.json file included in the example application. We create the following state machine using the AWS States Language:

{
  "Comment": "A distributed saga example.",
  "StartAt": "BookTrip",
  "States": {
    "BookTrip": {
      "Type": "Parallel",
      "Next": "Trip Booking Successful",
      "Branches": [
         {
          "StartAt": "BookHotel",
          "States": {
             "BookHotel": {
                "Type": "Task",
                "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-bookHotel",
                "ResultPath": "$.BookHotelResult",
                "End": true
              }
          }
        },
        {
          "StartAt": "BookFlight",
          "States": {
            "BookFlight": {
                "Type": "Task",
                "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-bookFlight",
                "ResultPath": "$.BookFlightResult",
                "End": true
              }
          }
        },
        {
          "StartAt": "BookCar",
          "States": {
             "BookCar": {
                "Type": "Task",
                "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-bookCar",
                "ResultPath": "$.BookCarResult",
                "End": true
          	}
          }
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.BookTripError",
          "Next": "Trip Booking Failed"
        }
      ]
    },
    "Trip Booking Failed": {
      "Type": "Pass",
      "Next": "CancelTrip"
    },
    "CancelTrip": {
      "Type": "Parallel",
      "Next": "Trip Booking Cancelled",
      "Branches": [
        {
          "StartAt": "CancelHotel",
          "States": {
            "CancelHotel": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-cancelHotel",
              "Catch": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "ResultPath": "$.CancelHotelError",
                  "Next": "CancelHotel"
                }
              ],
              "ResultPath": "$.CancelHotelResult",
              "End": true
            }
          }
        },
        {
          "StartAt": "CancelFlight",
          "States": {
            "CancelFlight": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-cancelFlight",
              "Catch": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "ResultPath": "$.CancelFlightError",
                  "Next": "CancelFlight"
                }
              ],
              "ResultPath": "$.CancelFlightResult",
              "End": true
            }
          }
        },
        {
          "StartAt": "CancelCar",
          "States": {
            "CancelCar": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-cancelCar",
              "Catch": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "ResultPath": "$.CancelCarError",
                  "Next": "CancelCar"
                }
              ],
              "ResultPath": "$.CancelCarResult",
              "End": true
            }
          }
        }
      ]
  	},
    "Trip Booking Successful": {
      "Type": "Succeed"
    },
    "Trip Booking Cancelled": {
      "Type": "Fail",
      "Cause": "Trip cancelled due to error.",
      "Error": "TripCancelledError"
    }
  }
}

In the above state machine:

  • We define a starting root state "StartAt": "BookTrip"
  • In the States section, we define a list of states in our state machine, such as BookHotel and CancelHotel Task states as well as the BookTrip Parallel state for executing things in parallel.
  • Each state can have a Catch clause which lets you route execution to other states if a certain type of error is returned. In our example, we have a catch configured for BookTrip so that if any of the branches BookHotel, BookFlight, and BookCar errors out the compensating requests CancelHotel, CancelFlight, and CancelCar is executed.
  • We also use the Catch to make our compensating requests retry itself if it failed to execute.

Remember to update any "Resource": "arn:aws:lambda:{YOUR_AWS_REGION}:{YOUR_AWS_ACCOUNT_ID}:function:serverless-sagas-dev-cancelCar" to use your deployed AWS Lambda function ARNs from the previous step.

Once created, you will then be able to visualize and execute your state machine on the AWS console:

Give the AWS States Language and Step Functions a try!

AWS Step Functions Summary

AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. The service makes it simple to manage distributed sagas that communicate with multiple components.

AWS Step Functions manages the operations and underlying infrastructure for you to help ensure your application is available at any scale.

In closing

We learned about what Distributed Sagas are (a pattern for handling failure in microservices) and how it solves the problem of distributed transactions. It helps ensure correctness and consistency in microservices.

We also learned about Saga Execution Coordinators, and how you can get started with defining state machines using the AWS States Language and AWS Step Functions.

The example application is on Github. Just hit deploy!

Additional reading