Stop Breaking Production: A Step Function Rollout Strategy That Works
How confident are you that your next change won’t blow up production?
In traditional server-based systems, gradual rollouts and canary deployments are standard practice. In serverless systems, however, we too often treat infrastructure as magic black boxes rather than critical, complex systems that deserve the same rigor. Even with extensive pre-production testing, this leaves us exposed to unexpected outages.
Gradual rollouts are not a “nice-to-have” — they’re essential, even for serverless.
Why Gradual Rollouts Matter To Us
Runa embarked on its serverless journey nearly 3 years ago, adopting serverless technologies to power our next-generation global payout platform. The platform needs to be fast, but reliability is non-negotiable.
At Runa, this problem is not theoretical. We use Step Functions extensively across our platform, and they sit on the path of many of our most important production workflows. That scale is what made rollout safety impossible to ignore. Engineers ship changes several times a day. Every release introduces risk, and without guardrails, that risk quickly compounds. To maintain high deployment velocity without compromising availability, we needed a deployment mechanism that ensured:
- Smooth end-user experience during releases
- Real-time monitoring and automated rollback capabilities
AWS already gives Lambda users a familiar deployment model via CodeDeploy, with built-in traffic shifting, and automated rollback. We wanted the equivalent operational model for Step Functions in the Terraform-driven environment we actually run. That is what led us to build Tweety.
Defining Requirements
Before designing the rollout mechanism, we aligned on a small set of baseline requirements. These were not aspirational goals. They were constraints driven by how we build and operate production systems:
- Terraform Compatibility: Our infrastructure is managed via Terraform. The rollout solution had to integrate seamlessly, leveraging the tools our engineers already use.
- Automation and Guardrails: Rollout and rollback processes had to be fully automated with built-in safeguards to prevent accidental missteps.
- Simplicity for Engineers: The underlying implementation could be complex, but interacting with it should not be. From an engineer’s perspective, deploying with gradual rollouts needed to feel no different from a standard release.
Planning For A Rollout
Before we could even start thinking about gradual releases, we needed two critical capabilities:
- The ability to version your changes
- The ability to balance traffic between the versions
A gradual rollout only works if we can precisely control which version of a Step Function is executed at any given time. AWS Step Functions already support versioning (publishing a version creates an immutable snapshot of the state machine definition). Using aliases, we can then control which version is invoked and gradually shift traffic between versions.
In AWS, invoking a specific version is achieved using qualified ARNs, which explicitly reference either a published version or an alias.
For example:
- Unqualified ARN – always points to the latest version:
arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessor - Version Qualified ARN – explicitly points to a published version:
arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessor:3 - Alias Qualified ARN – points to an alias that can route traffic between versions:
arn:aws:states:us-east-1:123456789012:stateMachine:OrderProcessor:prod
In order to ensure stability during the rollout, we must use qualified ARNs across our infrastructure.
Only Shift Traffic At The Entry Point
Because Step Functions execute downstream services as part of a single workflow, traffic shifting can only happen at the entry point, which is the alias of the top-level state machine being invoked. Any child resources, such as nested Step Functions or Lambdas, must be version pinned.
If we attempt to shift traffic for downstream resources independently, we risk creating version mismatches when the contract between versions changes. For example, a new state machine may invoke an older Lambda with an incompatible input, or an older state machine may call a newer Lambda expecting fields that are no longer present. These mismatches can lead to runtime errors or partial outages that are hard to detect during rollout.
The diagram below shows an example of an incorrect rollout, where traffic is shifted independently across state machines and downstream Lambdas:

Version Pin Downstream Dependencies
To ensure that changes to child resources are rolled out gradually and safely, we made version pinning mandatory for all downstream resources. This prevents changes from reaching production immediately and ensures that each Step Function version executes against a consistent set of dependencies.
Version pinning also plays a key role in driving the gradual rollout process as new versions of downstream resources are introduced and referenced by a new state machine version.

IAM Resources Are Not Versioned
This is easy to overlook during rollouts and has caused real production failures for us in the past.
While Step Functions can be versioned and aliased, IAM roles and policies have no built-in versioning. All versions of a Step Function share the same IAM configuration.
This has an important implication. Any IAM change must remain fully backward compatible with all Step Function versions that are still receiving traffic. Removing permissions, even if they are no longer required by the latest version, can cause older executions to fail with AccessDenied errors during a rollout.
Once we were able to version workflows and shift traffic safely, we formalised 3 non-negotiable rules:
- Pin all downstream resources
- Shift traffic only at the entry point
- Keep IAM changes backward compatible
Violating any of these rules risks breaking production in the middle of a rollout.
Meet Tweety
To orchestrate the release process, we built Tweety, our in-house release manager implemented as a Step Function. The name is inspired by the Looney Tunes canary, reflecting Tweety’s role as an early warning system during deployments.
The idea behind Tweety is deliberately simple. If you have used Lambda canary deployments via CodeDeploy before, the model will feel familiar: publish an immutable version, shift traffic gradually, monitor health, and roll back automatically when signals degrade.
Tweety brings that operating model to Step Functions, packaged in a way that fits our Terraform workflows, our dependency-pinning rules, and the realities of rolling out orchestration logic safely.
It handles rollout between versions, monitors health checks, and automatically rolls back if alarms are triggered:

To kick off the gradual rollout process, we invoke Tweety whenever a Step Function version changes. The invocation is triggered from Terraform using the AWS CLI, with user-defined inputs such as traffic shift percentage, rollout interval, and monitoring thresholds:
resource "null_resource" "run_gradual_rollout" {
triggers = {
sfn_version = var.version_arn
}
provisioner "local-exec" {
# Only trigger deployment if the state machine version is not equal to 1
command = <<EOT
if [ ${var.version_arn} != ${local.current_version_arn} ]; then
aws stepfunctions start-execution \\
--state-machine-arn ${var.gradual_rollout_stepfunction} \\
--input '${jsonencode(local.rollout_payload)}'
fi
EOT
}
}
Once the rollout is initiated, Tweety runs two jobs in parallel:
- Traffic management This job determines the next traffic weight and updates the Step Function alias accordingly.
- Health checks In parallel, the
Health checkjob continuously evaluates system metrics against team-defined thresholds. If any degradation is detected, it stops the rollout, initiates a rollback, and alerts the team that the release has failed.
Building our own gradual rollout process allowed us to fully customise it to our needs, from Slack notifications to support for non-standard deployment strategies.
DevX as a functional requirement
Ease of adoption was a first-class concern when designing the rollout solution. Gradual rollouts had to be simple to adopt and should not require engineers to understand the internal workings of the rollout engine. To achieve this, we wrapped the required resources in a Terraform module that hides the complexity and manages the gradual rollout process end to end.
From an engineer’s perspective, deploying with gradual rollouts feels no different from any other change, except that it is safer. Engineers provide a small set of inputs to a Terraform module, and the rest is handled automatically: versioning, alias shifts, monitoring, rollback, and Slack updates.
The visual execution graphs and built-in logging provided by Step Functions also help engineers better understand the deployment process and simplify troubleshooting when things go wrong.
Closing thoughts
Once the gradual rollout mechanism was in place, it was quickly adopted across our engineering teams. A key enabler was the set of custom Terraform modules we built, which made safe rollouts easy to apply and operate consistently.
By adding safety guardrails to Step Functions, we enabled engineers to move fast and deploy with confidence, even during peak season, without putting production stability at risk.
While AWS provides powerful primitives such as Step Function versions and aliases, building a safe rollout workflow still requires stitching those pieces together with operational guardrails. Tweety is our attempt to codify those best practices into a reusable tool.
We are now open-sourcing the Terraform module behind Tweety so other teams can adopt and adapt the same approach in their own AWS environments.
To get started, you can find our opensource module here: https://registry.terraform.io/modules/wegift/tweety-sfn/aws/latest