CloudFormation ROLLBACK_COMPLETE: Understanding and Recovering from Stack Failures

You run aws cloudformation create-stack and wait. The stack creation starts, resources begin provisioning, and then everything stops. The stack status changes to ROLLBACK_IN_PROGRESS and eventually settles on ROLLBACK_COMPLETE. Your stack exists but contains no resources. You cannot update it. You cannot retry the creation. It just sits there, a monument to something that went wrong.

This is one of the most frustrating states in all of AWS. A stack in ROLLBACK_COMPLETE cannot be updated — it can only be deleted and recreated. But before you delete it and try again blindly, you need to understand what went wrong. Otherwise, you will end up in the same state five minutes later.

Here is how to extract meaningful error information from a failed stack, understand the most common causes, and recover efficiently.

Extracting the Real Error Message

CloudFormation buries the actual error messages in the stack events. The stack status itself just says ROLLBACK_COMPLETE, which tells you nothing about the cause. The details are in the individual resource events.

# Get all events for the failed stack, most recent first
aws cloudformation describe-stack-events \
  --stack-name my-failed-stack \
  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].{
    Resource: LogicalResourceId,
    Type: ResourceType,
    Reason: ResourceStatusReason,
    Timestamp: Timestamp
  }' \
  --output table

This filters the event stream to show only the resources that actually failed, along with the reason. The first CREATE_FAILED event is usually the root cause — subsequent failures are often cascading effects of the first failure.

For a more complete picture of the timeline:

# Get the full event timeline
aws cloudformation describe-stack-events \
  --stack-name my-failed-stack \
  --query 'StackEvents[*].{
    Time: Timestamp,
    Resource: LogicalResourceId,
    Status: ResourceStatus,
    Reason: ResourceStatusReason
  }' \
  --output table

Read the events from bottom (oldest) to top (newest) to understand the sequence: which resource failed first, what the error was, and how the rollback propagated.

Root Cause 1: IAM Insufficient Permissions

The most common cause of stack failures is the CloudFormation service role (or the IAM user/role running the command) lacking permissions to create the resources in the template.

The error message looks like:

API: ec2:CreateSecurityGroup You are not authorized to perform this operation.
Encoded authorization failure message: ...

Or more subtly:

Resource handler returned message: "Access denied for operation 'AWS::RDS::DBInstance'"

Diagnosing the permission issue:

If the error includes an encoded authorization failure message, decode it:

aws sts decode-authorization-message \
  --encoded-message "LONG_ENCODED_STRING_HERE" \
  --query 'DecodedMessage' \
  --output text | python3 -m json.tool

The decoded message shows exactly which IAM action was denied, on which resource, and which policy conditions were evaluated. This is far more useful than the generic "not authorized" message.

The fix: Add the required permissions to the CloudFormation service role. If you are using a dedicated CloudFormation service role (recommended), update its policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateSecurityGroup",
        "ec2:DeleteSecurityGroup",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RevokeSecurityGroupEgress",
        "ec2:DescribeSecurityGroups"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"
        }
      }
    }
  ]
}

Note that CloudFormation needs both create and delete permissions for every resource type. If it can create a resource but not delete it, the rollback itself will fail, putting you in the even worse ROLLBACK_FAILED state.

Root Cause 2: Resource Limit Exceeded

Every AWS account has service quotas. If your template tries to create a resource that would exceed a quota, the creation fails.

Common examples:

The maximum number of VPCs has been reached. (Service: AmazonEC2; Status Code: 400)

The maximum number of addresses has been reached. (Service: AmazonEC2; Status Code: 400)

Cannot create more than 200 security groups for vpc-0abc123

Check your current quotas:

# List all EC2 quotas with current usage
aws service-quotas list-service-quotas \
  --service-code ec2 \
  --query 'Quotas[?UsageMetric].{
    Name: QuotaName,
    Value: Value,
    Adjustable: Adjustable
  }' \
  --output table

# Check a specific quota
aws service-quotas get-service-quota \
  --service-code vpc \
  --quota-code L-F678F1CE \
  --query 'Quota.{Name: QuotaName, Value: Value}'

The fix: Either request a quota increase through the Service Quotas console or API, or redesign your template to use fewer resources.

# Request a quota increase
aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-0263D0A3 \
  --desired-value 10

Root Cause 3: Invalid Property Values

Template validation catches syntax errors, but it cannot validate whether a property value is actually valid in your account and region. For example:

The subnet ID 'subnet-0abc123' does not exist (Service: AmazonEC2)

The key pair 'my-keypair' does not exist (Service: AmazonEC2)

The AMI ID 'ami-0abc123' does not exist

These errors happen when:

You copy a template from one account or region to another without updating resource references
A referenced resource (subnet, key pair, AMI) was deleted after the template was written
You use hardcoded resource IDs instead of parameters or cross-stack references

The fix: Use parameters for all environment-specific values and validate them before stack creation:

# Validate the template syntax
aws cloudformation validate-template \
  --template-body file://template.yaml

# Verify referenced resources exist
aws ec2 describe-subnets --subnet-ids subnet-0abc123
aws ec2 describe-key-pairs --key-names my-keypair
aws ec2 describe-images --image-ids ami-0abc123

Root Cause 4: Resource Already Exists

If your template specifies a physical resource name (like a DynamoDB table name or S3 bucket name) and that resource already exists, the creation fails:

my-table already exists in stack arn:aws:cloudformation:us-east-1:123456789:stack/other-stack/abc123

my-bucket already exists

This happens frequently when you delete a stack and recreate it, because some resources (like S3 buckets with data) are retained on stack deletion and are not cleaned up.

Find which stack owns a resource:

# Search for the resource across all stacks
aws cloudformation list-stack-resources \
  --stack-name other-stack \
  --query 'StackResourceSummaries[?PhysicalResourceId==`my-table`]'

The fix: Either rename the resource in your template, import the existing resource into your stack, or delete the orphaned resource first. For S3 buckets, you must empty the bucket before deleting it:

# Empty and delete an orphaned S3 bucket
aws s3 rm s3://my-bucket --recursive
aws s3api delete-bucket --bucket my-bucket

Root Cause 5: Circular Dependencies

CloudFormation resolves resource creation order based on dependency references (Ref, Fn::GetAtt, DependsOn). If Resource A references Resource B and Resource B references Resource A, CloudFormation cannot determine which to create first:

Circular dependency between resources: [SecurityGroup, LaunchTemplate]

This error is caught during template validation, but more subtle circular dependencies can appear at creation time when nested stacks or custom resources create indirect dependency loops.

The fix: Break the cycle by separating the resources. For the common Security Group circular dependency (Group A allows traffic from Group B, Group B allows traffic from Group A), use separate security group rule resources:

SecurityGroupA:
  Type: AWS::EC2::SecurityGroup
  Properties:
    GroupDescription: Group A

SecurityGroupB:
  Type: AWS::EC2::SecurityGroup
  Properties:
    GroupDescription: Group B

# Add the ingress rules as separate resources to break the cycle
IngressAFromB:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !Ref SecurityGroupA
    SourceSecurityGroupId: !Ref SecurityGroupB
    IpProtocol: tcp
    FromPort: 443
    ToPort: 443

IngressBFromA:
  Type: AWS::EC2::SecurityGroupIngress
  Properties:
    GroupId: !Ref SecurityGroupB
    SourceSecurityGroupId: !Ref SecurityGroupA
    IpProtocol: tcp
    FromPort: 443
    ToPort: 443

Recovering from ROLLBACK_COMPLETE

A stack in ROLLBACK_COMPLETE cannot be updated. Your only option is to delete it and try again:

# Delete the failed stack
aws cloudformation delete-stack \
  --stack-name my-failed-stack

# Wait for deletion to complete
aws cloudformation wait stack-delete-complete \
  --stack-name my-failed-stack

Before recreating, fix the root cause identified in the stack events. Then recreate with the corrected template.

Recovering from ROLLBACK_FAILED and DELETE_FAILED

Sometimes the rollback itself fails — perhaps CloudFormation created a resource but cannot delete it (the IAM role lacks delete permissions, or the resource has dependencies that prevent deletion). The stack enters ROLLBACK_FAILED or DELETE_FAILED.

For ROLLBACK_FAILED:

# Continue the rollback, skipping resources that cannot be deleted
aws cloudformation continue-update-rollback \
  --stack-name my-stuck-stack \
  --resources-to-skip "MyS3Bucket" "MyDynamoTable"

This tells CloudFormation to skip the problematic resources and complete the rollback. You will need to manually clean up the skipped resources afterward.

For DELETE_FAILED:

# Force delete, retaining resources that cannot be deleted
aws cloudformation delete-stack \
  --stack-name my-stuck-stack \
  --retain-resources "MyS3Bucket" "MyDynamoTable"

Using --disable-rollback for Debugging

When you are debugging a template that keeps failing, the automatic rollback is counterproductive — it destroys the resources before you can inspect them. Use --disable-rollback to keep the resources in place after a failure:

aws cloudformation create-stack \
  --stack-name my-debug-stack \
  --template-body file://template.yaml \
  --disable-rollback \
  --parameters ParameterKey=Environment,ParameterValue=dev

If the stack fails, it enters CREATE_FAILED instead of ROLLBACK_COMPLETE. The successfully created resources remain, and you can inspect them to understand the failure. When you are done debugging, delete the stack manually.

Important: never use --disable-rollback in production pipelines. It is a debugging tool only.

Drift Detection: Why Updates Fail on Existing Stacks

Sometimes a stack that has been running fine for months suddenly fails on update. The cause is often drift — someone modified a stack-managed resource directly through the console or CLI, and the stack's view of reality no longer matches actual state.

# Initiate drift detection
aws cloudformation detect-stack-drift \
  --stack-name my-stack

# Check drift detection status (may take a few minutes)
aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id "detection-id-here"

# View drifted resources
aws cloudformation describe-stack-resource-drifts \
  --stack-name my-stack \
  --stack-resource-drift-status-filters MODIFIED DELETED \
  --query 'StackResourceDrifts[*].{
    Resource: LogicalResourceId,
    Status: StackResourceDriftStatus,
    Differences: PropertyDifferences
  }'

If drift is detected, you need to either update the template to match the current state or manually revert the drifted resources to match the template before running the stack update.

Prevention Best Practices

Always run validate-template before create-stack — catch syntax errors early
Use change sets for updates — review what CloudFormation plans to do before it does it:

aws cloudformation create-change-set \
  --stack-name my-stack \
  --template-body file://template.yaml \
  --change-set-name my-changes

aws cloudformation describe-change-set \
  --stack-name my-stack \
  --change-set-name my-changes \
  --query 'Changes[*].ResourceChange.{
    Action: Action,
    Resource: LogicalResourceId,
    Replacement: Replacement
  }'

Avoid hardcoded resource names — let CloudFormation auto-generate names to prevent "already exists" errors
Use stack policies to prevent accidental deletion of critical resources
Test in a non-production account first — catch permission and quota issues before they affect production

When CloudFormation Problems Become Chronic

If your team regularly fights with CloudFormation failures, the issue is usually not CloudFormation itself — it is the template architecture, the permission model, or the deployment process. Poorly structured templates with hundreds of resources, insufficient service role permissions, and manual deployments without change sets are recipes for repeated failures.

We help teams restructure their CloudFormation templates into manageable, nested stacks with proper dependency management, implement CI/CD pipelines with automated change set review, and build service roles with least-privilege permissions. Reach out for a free AWS consultation and let us help you build a deployment process that works reliably.