Step Functions: Orchestrating Multi-Step Workflows and the Saga Pattern

K
Kai··6 min read

Counting clicks in the earlier articles was a single step: receive the event, increment a number, done. But some processes have many ordered steps, branches, and need error handling at each step. Moderating a link before letting it go live is one example: scan whether the URL is safe, then depending on the result either activate or reject it, and if the scan step hits a transient error, retry. Cramming that whole chain into one Lambda makes the code messy and the flow hard to see. Step Functions exists to pull that orchestration out.

Goal

Build a link-moderation state machine with Step Functions: a safety-scan step (Lambda), a step that branches on the result, and two state-update steps that call DynamoDB directly without Lambda. Add Retry and Catch for error handling, run it for real to watch the flow move through the states, then cover the saga pattern for undoing. Step Functions Standard bills per state transition, so a workflow of a few steps is essentially free at test scale.

Step Functions orchestrates, it doesn't process

A state machine is a diagram of steps (states) and how to move between them, written in the Amazon States Language (ASL) as JSON. Each state does one thing: call a Lambda, call an AWS service directly, branch on a condition, wait, or finish. Step Functions keeps state between steps, retries automatically per the rules you declare, and records the history of every execution. The heavy lifting still lives in Lambda or a service; Step Functions handles orchestrating them in the right order and dealing with errors.

Standard or Express

Step Functions has two workflow types, and choosing wrong either costs money or loses a guarantee. The docs draw the line clearly: Standard Workflows "support long-running executions (up to one year) with exactly-once execution semantics, making them suitable for non-idempotent actions like payment processing, and are billed per state transition." Express Workflows "are designed for high-volume, short-duration workloads (up to five minutes) with at-least-once execution semantics... billed based on execution count, duration, and memory."

In short: Standard is for long-running processes that need an exactly-once guarantee and history to inspect (approvals, payments); Express is for very high-frequency, short, idempotent flows. Link moderation doesn't need tens of thousands of runs per second, but it does need to be inspectable per execution and to guarantee it won't activate something twice, so we pick Standard.

The moderation workflow

The process has four states. CheckSafety calls a Lambda that scans the URL and returns safe: true/false. IsSafe is a Choice that branches on that value. Activate and Reject update the link's state in DynamoDB. The notable part: Activate and Reject do not go through Lambda but call dynamodb:updateItem directly as a service integration, because the job is just a single write — no code needed.

        ┌─────────────┐
        │ CheckSafety │  Task -> Lambda scans URL  (Retry 3x, Catch -> MarkError)
        └──────┬──────┘
               ▼
          ┌─────────┐   safe == true
          │ IsSafe  │───────────────▶  ┌──────────┐  dynamodb:updateItem
          │(Choice) │                  │ Activate │  status = "active"
          └────┬────┘                  └──────────┘
               │ default (not safe)
               ▼
          ┌──────────┐  dynamodb:updateItem
          │  Reject  │  status = "rejected"
          └──────────┘

The CheckSafety step declares Retry and Catch for error handling:

"CheckSafety": {
  "Type": "Task",
  "Resource": "${ModerateCheckArn}",
  "Retry": [
    {
      "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException", "States.TaskFailed"],
      "IntervalSeconds": 1, "MaxAttempts": 3, "BackoffRate": 2
    }
  ],
  "Catch": [ { "ErrorEquals": ["States.ALL"], "Next": "MarkError" } ],
  "Next": "IsSafe"
}

Retry automatically retries up to three times with increasing backoff on a transient error. If it's still failing after the attempts run out, Catch catches every remaining error and moves to the MarkError state instead of letting the whole workflow die silently. Written inside a single Lambda, this error handling would be a tangle of try/catch and loops; here it's declarative.

The branch step is a Choice that reads the safe field in the data:

"IsSafe": {
  "Type": "Choice",
  "Choices": [ { "Variable": "$.safe", "BooleanEquals": true, "Next": "Activate" } ],
  "Default": "Reject"
}

The state machine is declared in SAM with AWS::Serverless::StateMachine, with the ARN values and table name substituted into the ASL via DefinitionSubstitutions:

LinkModerationStateMachine:
  Type: AWS::Serverless::StateMachine
  Properties:
    Type: STANDARD
    DefinitionUri: statemachine/moderation.asl.json
    DefinitionSubstitutions:
      ModerateCheckArn: !GetAtt ModerateCheckFunction.Arn
      TableName: !Ref Table
    Policies:
      - LambdaInvokePolicy: { FunctionName: !Ref ModerateCheckFunction }
      - DynamoDBCrudPolicy: { TableName: !Ref Table }

Running it for real

Create two links in pending state, one pointing at a normal URL, one containing a word from the blocklist. Run the workflow for each, then read the state. Both executions return SUCCEEDED, and the link state reflects the branch taken:

=== execution link an toan ===  SUCCEEDED
=== execution link co 'malware' ===  SUCCEEDED

$ status trong DynamoDB sau kiem duyet:
  safe001: active
  bad001 : rejected

The same workflow, two different outcomes depending on the input data. To see which states the flow actually moved through, get the history of the run for the unsafe link:

$ aws stepfunctions get-execution-history --execution-arn "$ARN" \
    --query "events[?stateEnteredEventDetails!=null].stateEnteredEventDetails.name" --output text
CheckSafety   IsSafe   Reject

The history shows the execution entered CheckSafety, passed through IsSafe, then branched to Reject, exactly as the diagram says. Every run leaves a trail like this, so when a real process goes wrong, you can see which step it stopped at and with what data — something that's very hard to get if the whole chain lives in one function.

Saga: undoing a mid-process failure

The workflow above only reads and writes one item, so no undo is needed. But many real processes are a chain of steps that change multiple places, and if a step in the middle fails, the earlier steps need to be undone. That's the saga pattern: each step has a compensation step to undo it, and on failure the workflow runs the compensations in reverse order.

Take a custom-domain feature the series won't build but is worth picturing: reserve a domain, charge a fee, then confirm. If the charge step fails after the reservation, you have to release the reserved domain, or it's locked forever. In ASL, each step can Catch an error and move to a compensation state:

"ChargeCard": {
  "Type": "Task",
  "Resource": "${ChargeArn}",
  "Catch": [ { "ErrorEquals": ["States.ALL"], "Next": "ReleaseDomain" } ],
  "Next": "ConfirmDomain"
},
"ReleaseDomain": {
  "Type": "Task",
  "Resource": "${ReleaseArn}",
  "Next": "FailSaga"
}

Step Functions fits saga because keeping state and catching errors per step is exactly what it does out of the box. You don't have to write your own "how far did I get so I know what to undo" logic; it's in the structure of the state machine. A Standard workflow with exactly-once is the right choice for this kind of process, because an undo that runs twice by accident just creates new errors.

🧹 Cleanup

aws dynamodb delete-item --table-name url-shortener --key '{"PK":{"S":"LINK#safe001"},"SK":{"S":"META"}}'
aws dynamodb delete-item --table-name url-shortener --key '{"PK":{"S":"LINK#bad001"},"SK":{"S":"META"}}'

Keep the stack for what's next.

Wrap-up

Step Functions separates the orchestration of a multi-step process from the processing code. Our moderation workflow has one Lambda scan step, one Choice branch step, and two DynamoDB write steps called directly without Lambda, plus declarative Retry and Catch instead of hand-written ones. Each run leaves a readable state history. Standard fits when you need an exactly-once guarantee and the ability to inspect; Express fits very high-frequency flows. And the saga pattern uses the same per-step Catch mechanism to undo when a step in the middle fails.

The event-driven part is complete: API, data, auth, events, realtime, orchestration. The product works end to end, but operating it like production means being able to see inside when something breaks. Part V opens with observability: the next article wires in Lambda Powertools for structured logs and tracing, then reads an X-Ray service map to see where a request goes and where it's slow.