Building Event-Driven Systems on AWS: EventBridge, SQS, and Step Functions in Production

Event-driven architecture is one of the most powerful patterns on AWS — and one of the most misunderstood. I have seen teams adopt it for the right reasons (loose coupling, independent scaling, pay-per-event pricing) and end up with distributed systems that are harder to operate than the monolith they replaced. I have also seen teams build event-driven platforms that process millions of events a day reliably, scale without intervention, and survive component failures gracefully.

The difference is almost never the technology choice. It is whether the team understood the fundamental tradeoffs before they started building.

This post is a practical guide to building event-driven systems on AWS that work in production. I will walk through a concrete order-processing workflow — the kind you would find at a mid-size e-commerce company — and show you the full lifecycle: choosing the right AWS messaging service, designing event schemas, wiring up routing rules, building Lambda consumers, configuring dead-letter queues, and tracing failures end-to-end with X-Ray.

Choosing the Right Service: EventBridge vs SQS vs SNS

This is the question I get asked most often, and the answer is almost never "just use EventBridge." Each service has a distinct role, and using the wrong one creates problems that are hard to fix after you have built on top of it.

Use Amazon EventBridge when:

You need content-based routing (route based on event payload fields, not just topic)
You want to connect AWS services to each other or to SaaS providers without custom glue code
You need an audit trail and event replay capability
You are building a system where multiple independent consumers need to react to the same event

Use Amazon SQS when:

You need guaranteed, at-least-once delivery to a single consumer
You want backpressure — the ability for the consumer to control its processing rate
You need FIFO ordering within a message group
You are queuing work for asynchronous processing (image resizing, email sending, report generation)

Use Amazon SNS when:

You need fan-out to multiple endpoints simultaneously (Lambda + SQS + HTTP endpoint)
You want simple publish/subscribe without routing logic
Delivery speed is more important than persistence

The pattern that works at scale:

In practice, EventBridge and SQS are complementary, not competing. The pattern I recommend for most production systems:

EventBridge as the central event bus — services publish domain events here
EventBridge rules route events to their consumers
SQS queues buffer events before they reach Lambda consumers — this provides backpressure, retry buffering, and DLQ support
Lambda consumes from the SQS queue, processing one batch at a time

This gives you the routing flexibility of EventBridge combined with the durability and consumer-rate control of SQS.

Service	Delivery	Consumers	Routing	Replay	Retention
EventBridge	At-least-once	Multiple	Content-based	Yes (archive)	24h default
SQS Standard	At-least-once	Single	None	No	Up to 14 days
SQS FIFO	Exactly-once	Single	None	No	Up to 14 days
SNS	At-least-once	Multiple	Attribute filter	No	No persistence

Designing Event Schemas That Survive Change

Schema design is the decision that will either save you or haunt you six months into production. An event schema is a contract between producer and consumer — and unlike API contracts, changing an event schema has consequences for every consumer that has ever subscribed to it.

The principles I apply on every engagement:

Principle 1: Events are facts, not commands

order.placed is an event (a fact that happened). process-payment is a command (an instruction). Commands couple the producer to a specific consumer and a specific behaviour. Events decouple them — the producer does not know or care what the consumers will do.

Principle 2: Include enough context to avoid callback calls

Every time a consumer has to call another service to get data missing from the event, you have reintroduced coupling. Include the data the consumer needs: not just orderId, but customerId, totalAmount, currency, and the line items.

Principle 3: Version your schemas from day one

The EventBridge Schema Registry is built for this. Register your event schemas early, and your team gets auto-generated code bindings in Python, TypeScript, Java, and Go.

# Create a Schema Registry for your application
aws schemas create-registry \
  --registry-name "my-ecommerce-app" \
  --description "Event schemas for order processing system"

# Register the initial order.placed schema
aws schemas create-schema \
  --registry-name "my-ecommerce-app" \
  --schema-name "com.myapp.orders@OrderPlaced" \
  --type JSONSchemaDraft4 \
  --content '{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "OrderPlaced",
    "type": "object",
    "properties": {
      "version": {"type": "string"},
      "orderId": {"type": "string"},
      "customerId": {"type": "string"},
      "totalAmount": {"type": "number"},
      "currency": {"type": "string"},
      "items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "productId": {"type": "string"},
            "quantity": {"type": "integer"},
            "unitPrice": {"type": "number"}
          }
        }
      },
      "placedAt": {"type": "string", "format": "date-time"}
    },
    "required": ["version", "orderId", "customerId", "totalAmount", "items", "placedAt"]
  }'

Principle 4: Add a version field to every event

When you need to make a breaking change, publish both the old and new event versions simultaneously. Consumers migrate at their own pace. Only retire the old version once all consumers have migrated.

A complete event envelope looks like this:

{
  "source": "com.myapp.orders",
  "detail-type": "order.placed",
  "detail": {
    "version": "1.0",
    "orderId": "ord-20260201-4829",
    "customerId": "cust-8472",
    "totalAmount": 149.95,
    "currency": "EUR",
    "items": [
      {"productId": "prod-291", "quantity": 2, "unitPrice": 49.95},
      {"productId": "prod-847", "quantity": 1, "unitPrice": 50.05}
    ],
    "placedAt": "2026-02-01T14:32:00Z"
  }
}

Building the Event Bus and Routing Rules

With the schema defined, let us wire up the EventBridge infrastructure for the order-processing workflow.

Create a dedicated event bus:

The default event bus mixes AWS service events with your application events. Use a dedicated application event bus — it is cleaner, easier to archive, and easier to audit.

# Create the application event bus
aws events create-event-bus \
  --name "ecommerce-events" \
  --description "Application events for order processing"

# Enable event archiving for replay and debugging
aws events create-archive \
  --archive-name "ecommerce-events-archive" \
  --event-source-arn "arn:aws:events:eu-central-1:123456789012:event-bus/ecommerce-events" \
  --retention-days 30

Create routing rules to fan out to SQS queues:

# Rule: route order.placed events to the payment queue
aws events put-rule \
  --name "order-placed-to-payment" \
  --event-bus-name "ecommerce-events" \
  --event-pattern '{
    "source": ["com.myapp.orders"],
    "detail-type": ["order.placed"]
  }' \
  --state ENABLED

# Create the SQS queue for the payment service
aws sqs create-queue \
  --queue-name "payment-processing-queue" \
  --attributes '{
    "VisibilityTimeout": "300",
    "MessageRetentionPeriod": "86400",
    "ReceiveMessageWaitTimeSeconds": "20"
  }'

# Add the SQS queue as the rule target
aws events put-targets \
  --event-bus-name "ecommerce-events" \
  --rule "order-placed-to-payment" \
  --targets '[{
    "Id": "payment-sqs",
    "Arn": "arn:aws:sqs:eu-central-1:123456789012:payment-processing-queue"
  }]'

# Route the same event to the inventory queue
aws events put-rule \
  --name "order-placed-to-inventory" \
  --event-bus-name "ecommerce-events" \
  --event-pattern '{
    "source": ["com.myapp.orders"],
    "detail-type": ["order.placed"]
  }' \
  --state ENABLED

aws events put-targets \
  --event-bus-name "ecommerce-events" \
  --rule "order-placed-to-inventory" \
  --targets '[{
    "Id": "inventory-sqs",
    "Arn": "arn:aws:sqs:eu-central-1:123456789012:inventory-reservation-queue"
  }]'

Publish a test event to verify routing:

aws events put-events \
  --entries '[{
    "EventBusName": "ecommerce-events",
    "Source": "com.myapp.orders",
    "DetailType": "order.placed",
    "Detail": "{\"version\":\"1.0\",\"orderId\":\"ord-test-001\",\"customerId\":\"cust-8472\",\"totalAmount\":149.95,\"currency\":\"EUR\",\"items\":[],\"placedAt\":\"2026-02-01T14:32:00Z\"}"
  }]'

Configuring Dead-Letter Queues and Retry Patterns

This is where most event-driven systems fail in production. Happy-path testing works fine, but as soon as a consumer throws an exception, messages pile up in the DLQ — or worse, get silently dropped — and nobody notices for days.

Every Lambda event source mapping and every SQS queue needs a properly configured DLQ. There are no exceptions.

Creating the DLQ infrastructure:

# Create a DLQ for the payment processing queue
aws sqs create-queue \
  --queue-name "payment-processing-dlq" \
  --attributes '{
    "MessageRetentionPeriod": "1209600"
  }'

# Set the redrive policy on the main queue (3 retries before DLQ)
aws sqs set-queue-attributes \
  --queue-url "https://sqs.eu-central-1.amazonaws.com/123456789012/payment-processing-queue" \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:eu-central-1:123456789012:payment-processing-dlq\",\"maxReceiveCount\":3}"
  }'

Lambda event source mapping with bisect-on-error:

When Lambda is consuming from SQS and a single message causes an error, the default behaviour is to retry the entire batch. A single poisoned message can block all other messages in the batch indefinitely. The fix is BisectBatchOnFunctionError:

# Create the Lambda event source mapping with smart error handling
aws lambda create-event-source-mapping \
  --function-name process-payment \
  --event-source-arn "arn:aws:sqs:eu-central-1:123456789012:payment-processing-queue" \
  --batch-size 10 \
  --maximum-batching-window-in-seconds 30 \
  --bisect-batch-on-function-error \
  --destination-config '{
    "OnFailure": {
      "Destination": "arn:aws:sqs:eu-central-1:123456789012:payment-processing-dlq"
    }
  }'

CloudWatch alarm on DLQ depth:

A DLQ that fills up in silence is worthless. Wire up an alarm so any DLQ message triggers immediate attention:

aws cloudwatch put-metric-alarm \
  --alarm-name "payment-dlq-messages" \
  --alarm-description "Messages in payment DLQ — investigation required" \
  --metric-name ApproximateNumberOfMessagesVisible \
  --namespace AWS/SQS \
  --dimensions Name=QueueName,Value=payment-processing-dlq \
  --statistic Maximum \
  --period 60 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions "arn:aws:sns:eu-central-1:123456789012:engineering-alerts" \
  --treat-missing-data notBreaching

Implementing exponential backoff in your Lambda handler:

Even with SQS's retry mechanism, you should implement backoff within your Lambda function for transient errors (downstream API rate limits, database connection limits, temporary network issues):

import { SQSEvent, SQSRecord } from 'aws-lambda';

export async function handler(event: SQSEvent): Promise<{ batchItemFailures: { itemIdentifier: string }[] }> {
  const failures: { itemIdentifier: string }[] = [];

  for (const record of event.Records) {
    try {
      await processPayment(JSON.parse(record.body));
    } catch (error) {
      if (isTransientError(error)) {
        // Return as failure — SQS will retry with backoff
        failures.push({ itemIdentifier: record.messageId });
      } else {
        // Non-transient errors: log and let message go to DLQ
        console.error('Permanent failure for message', record.messageId, error);
        failures.push({ itemIdentifier: record.messageId });
      }
    }
  }

  return { batchItemFailures: failures };
}

function isTransientError(error: unknown): boolean {
  if (error instanceof Error) {
    return error.message.includes('ThrottlingException') ||
           error.message.includes('ServiceUnavailable') ||
           error.message.includes('ETIMEDOUT');
  }
  return false;
}

The key is using the batchItemFailures response format — this tells SQS exactly which messages failed, so only those are retried. Without it, SQS retries the entire batch or drops all messages.

Orchestrating Multi-Step Workflows with Step Functions

EventBridge and SQS handle event routing and delivery. But when you need to coordinate a sequence of steps — charge the payment, reserve the inventory, dispatch the fulfilment notification — you need orchestration, not just choreography.

Direct Lambda-to-Lambda invocation for multi-step workflows is one of the most common architectural mistakes I see. It creates a tightly coupled chain where any step's failure can leave the system in an inconsistent state, and debugging requires reading through logs from multiple functions simultaneously.

Step Functions solves all of this.

When to use Standard vs Express Workflows:

Workflow Type	Duration	Execution Model	Pricing	Use Case
Standard	Up to 1 year	Exactly-once	Per state transition	Order fulfillment, long-running business processes
Express	Up to 5 minutes	At-least-once	Per duration + memory	High-volume, short-lived orchestration (API responses)

For the order processing workflow — which may wait hours for payment confirmation — use a Standard Workflow.

Create the order fulfilment state machine:

aws stepfunctions create-state-machine \
  --name "order-fulfillment" \
  --role-arn "arn:aws:iam::123456789012:role/stepfunctions-execution-role" \
  --type STANDARD \
  --definition '{
    "Comment": "Order fulfillment workflow",
    "StartAt": "ProcessPayment",
    "States": {
      "ProcessPayment": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:eu-central-1:123456789012:function:process-payment",
        "Retry": [{
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }],
        "Catch": [{
          "ErrorEquals": ["PaymentDeclined"],
          "Next": "HandlePaymentDeclined"
        }],
        "Next": "ReserveInventory"
      },
      "ReserveInventory": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:eu-central-1:123456789012:function:reserve-inventory",
        "Retry": [{
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 5,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }],
        "Catch": [{
          "ErrorEquals": ["InsufficientInventory"],
          "Next": "HandleOutOfStock"
        }],
        "Next": "SendConfirmationEmail"
      },
      "SendConfirmationEmail": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:eu-central-1:123456789012:function:send-confirmation",
        "End": true
      },
      "HandlePaymentDeclined": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:eu-central-1:123456789012:function:notify-payment-failure",
        "End": true
      },
      "HandleOutOfStock": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:eu-central-1:123456789012:function:notify-out-of-stock",
        "End": true
      }
    }
  }'

Trigger the state machine from EventBridge:

# Route order.placed events to the Step Functions state machine
aws events put-rule \
  --name "order-placed-to-fulfillment" \
  --event-bus-name "ecommerce-events" \
  --event-pattern '{
    "source": ["com.myapp.orders"],
    "detail-type": ["order.placed"]
  }' \
  --state ENABLED

aws events put-targets \
  --event-bus-name "ecommerce-events" \
  --rule "order-placed-to-fulfillment" \
  --targets '[{
    "Id": "fulfillment-sfn",
    "Arn": "arn:aws:states:eu-central-1:123456789012:stateMachine:order-fulfillment",
    "RoleArn": "arn:aws:iam::123456789012:role/eventbridge-sfn-role",
    "InputTransformer": {
      "InputPathsMap": {
        "orderId": "$.detail.orderId",
        "customerId": "$.detail.customerId",
        "totalAmount": "$.detail.totalAmount",
        "items": "$.detail.items"
      },
      "InputTemplate": "{\"orderId\":<orderId>,\"customerId\":<customerId>,\"totalAmount\":<totalAmount>,\"items\":<items>}"
    }
  }]'

End-to-End Observability with X-Ray

An event-driven system with three services, EventBridge, SQS, and Step Functions is already complex enough that reading CloudWatch logs to debug a failed order is painful. X-Ray tracing transforms that experience — a single trace shows you every service the event touched, how long each step took, and exactly where and why it failed.

Enable X-Ray on all components:

# Enable X-Ray on each Lambda function
for function in process-payment reserve-inventory send-confirmation; do
  aws lambda update-function-configuration \
    --function-name "$function" \
    --tracing-config Mode=Active
done

# Enable X-Ray on the Step Functions state machine
aws stepfunctions update-state-machine \
  --state-machine-arn "arn:aws:states:eu-central-1:123456789012:stateMachine:order-fulfillment" \
  --tracing-configuration enabled=true

Create a Service Map query to monitor the order flow:

# Get the X-Ray service map for the last hour
aws xray get-service-graph \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --query 'Services[*].{Name:Name,Type:Type,ResponseTimeHistogram:ResponseTimeHistogram}'

Set up a CloudWatch Insights query for tracing errors:

# Query for traces with errors in the last 3 hours
aws xray get-trace-summaries \
  --start-time $(date -u -d '3 hours ago' +%s) \
  --end-time $(date -u +%s) \
  --filter-expression 'fault = true OR error = true' \
  --query 'TraceSummaries[*].{Id:Id,Duration:Duration,HasError:HasError,HasFault:HasFault}'

Add correlation IDs to your events:

The final piece that ties everything together is propagating a correlation ID through every message, log entry, and trace. When a customer calls to report a failed order, you need to be able to pull up every log entry related to that order in seconds.

import { tracer } from '@aws-lambda-powertools/tracer';
import { logger } from '@aws-lambda-powertools/logger';

export async function handler(event: SQSEvent) {
  const segment = tracer.getSegment();

  for (const record of event.Records) {
    const payload = JSON.parse(record.body);
    const { orderId } = payload.detail;

    // Add orderId to all subsequent traces and logs
    tracer.putAnnotation('orderId', orderId);
    logger.appendKeys({ orderId, correlationId: payload.correlationId });

    logger.info('Processing payment', { totalAmount: payload.detail.totalAmount });

    // ... processing logic
  }
}

A Note on Idempotency

One property that every event consumer must have is idempotency — the ability to process the same event multiple times with the same result. In an at-least-once delivery system like SQS and EventBridge, duplicate delivery is not a bug; it is a documented feature.

For payment processing, this means checking whether a payment for a given orderId already exists before creating a new charge. For inventory reservation, it means reserving idempotently against the order ID, not blindly decrementing stock.

The simplest pattern: use a DynamoDB conditional write to ensure a given event is processed exactly once:

# Record successful processing with a conditional write
aws dynamodb put-item \
  --table-name "processed-events" \
  --item '{"eventId": {"S": "ord-20260201-4829"}, "processedAt": {"S": "2026-02-01T14:32:05Z"}}' \
  --condition-expression "attribute_not_exists(eventId)"

If the condition fails (the item already exists), the event has already been processed — skip it silently. If the condition succeeds, process and record in a single atomic operation.

What Production Actually Looks Like

I built a system like this for an e-commerce client handling 40,000 orders per day. After six months in production, the failure rate for the fulfillment workflow is under 0.02% — and those failures are all legitimate business errors (payment declined, out-of-stock) that flow through the Step Functions error paths as designed. Zero messages have been lost. The DLQ alarm has fired four times, each time catching a transient integration issue with the payment gateway that was resolved within minutes.

The key decisions that made it reliable: SQS between EventBridge and Lambda (not direct Lambda triggers), BisectBatchOnFunctionError on every event source mapping, DLQ alarms wired to PagerDuty, X-Ray active across the full stack, and idempotency at every consumer.

If you are building something similar and want a second pair of eyes on your architecture before you commit to it, book a free consultation. I am happy to review your design and flag the failure modes that are hardest to discover in testing.