ECS Task Stopped: Diagnosing Container Failures and Placement Errors

You deploy a new version of your ECS service and watch the deployment stall. Tasks start, run for a few seconds, and stop. New tasks launch to replace them, and they stop too. The service event log fills up with a rotating cast of errors:

service my-service was unable to place a task because no container instance
met all of its requirements. The closest matching container instance
doesn't have enough CPU units available.

CannotPullContainerError: ref pull has been retried 5 time(s):
failed to resolve reference 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v2.3.1

Essential container in task exited. Exit code: 137

Each of those messages points to a completely different root cause. In this guide, I will walk through the most common ECS task stopped scenarios, show you exactly how to diagnose each one, and give you the CLI commands and configuration fixes to resolve them permanently.

Getting the Full Picture: Describe the Stopped Task

Before you can fix anything, you need the stopped reason, exit code, and container-level details. ECS retains stopped task information for about one hour after the task stops, so act quickly.

# List recently stopped tasks for a service
aws ecs list-tasks \
  --cluster production \
  --service-name my-service \
  --desired-status STOPPED \
  --query 'taskArns[0:5]' \
  --output text

Once you have a task ARN, describe it in detail:

aws ecs describe-tasks \
  --cluster production \
  --tasks arn:aws:ecs:us-east-1:123456789:task/production/a1b2c3d4 \
  --query 'tasks[0].{
    stopCode: stopCode,
    stoppedReason: stoppedReason,
    stoppedAt: stoppedAt,
    startedAt: startedAt,
    healthStatus: healthStatus,
    containers: containers[*].{
      name: name,
      exitCode: exitCode,
      reason: reason,
      lastStatus: lastStatus,
      networkBindings: networkBindings
    },
    cpu: cpu,
    memory: memory
  }'

The stopCode field is your first fork in the diagnostic tree. The three values you will encounter are:

TaskFailedToStart — the task never got running. Image pull errors, secrets resolution failures, or resource placement issues.
EssentialContainerExited — a container marked as essential stopped, so ECS stopped the entire task.
ServiceSchedulerInitiated — ECS itself stopped the task, usually because of a health check failure or a deployment replacement.

Root Cause 1: CannotPullContainerError

This is the most frustrating error because it stops your task before any application code runs. The image pull fails, and ECS retries several times before giving up.

ECR Permission Issues

The ECS task execution role needs permission to pull images from ECR. If you recently moved to a new ECR repository or changed the task execution role, the pull will fail.

# Check what execution role the task definition uses
aws ecs describe-task-definition \
  --task-definition my-service:42 \
  --query 'taskDefinition.executionRoleArn'

Verify that role has the required ECR permissions:

aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789:role/ecsTaskExecutionRole \
  --action-names ecr:GetDownloadUrlForLayer ecr:BatchGetImage ecr:GetAuthorizationToken \
  --output table

The minimum IAM policy for ECR image pulls:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "ecr:BatchCheckLayerAvailability"
      ],
      "Resource": "arn:aws:ecr:us-east-1:123456789:repository/my-app"
    },
    {
      "Effect": "Allow",
      "Action": "ecr:GetAuthorizationToken",
      "Resource": "*"
    }
  ]
}

Image Tag Does Not Exist

A surprisingly common cause: the image tag referenced in the task definition does not exist in the repository. This happens when CI/CD pipelines update the task definition before the image push completes, or when someone deletes an image tag from ECR.

# List available image tags in the repository
aws ecr describe-images \
  --repository-name my-app \
  --query 'imageDetails[*].{tags: imageTags, pushedAt: imagePushedAt}' \
  --output table | head -20

Verify the specific tag exists:

aws ecr describe-images \
  --repository-name my-app \
  --image-ids imageTag=v2.3.1

If the command returns an error, the tag does not exist and that is your problem. Either push the correct image or update the task definition to reference an existing tag.

VPC Endpoint or NAT Gateway Missing

If your ECS tasks run in private subnets, they need a path to ECR. Without a NAT Gateway or VPC endpoints for ECR, the image pull will time out and fail. You need VPC endpoints for ecr.api, ecr.dkr, and s3 (ECR stores layers in S3).

# Check existing VPC endpoints
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-0abc123" \
  --query 'VpcEndpoints[*].{Service: ServiceName, State: State}' \
  --output table

Root Cause 2: OutOfMemoryError and Exit Code 137

Exit code 137 means the container received SIGKILL (128 + 9 = 137). In ECS, this almost always means the container exceeded its memory limit and the kernel's OOM killer terminated it.

There is a subtle but critical distinction in ECS memory configuration that trips up many teams:

memory (hard limit) — the container is killed if it exceeds this value
memoryReservation (soft limit) — used for placement decisions but the container can burst above it

If you set memory: 512 and your Java application's heap grows to 600MB, the container is killed instantly with exit code 137. No graceful shutdown. No error log. Just dead.

Diagnosing OOM Kills

# Check the container's reason field for OOM
aws ecs describe-tasks \
  --cluster production \
  --tasks arn:aws:ecs:us-east-1:123456789:task/production/a1b2c3d4 \
  --query 'tasks[0].containers[?exitCode==`137`].{name: name, reason: reason}'

For Fargate tasks, check the Container Insights memory utilization metric:

aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=production Name=ServiceName,Value=my-service \
  --start-time 2026-04-18T00:00:00Z \
  --end-time 2026-04-19T00:00:00Z \
  --period 300 \
  --statistics Average Maximum \
  --output table

If the Maximum is consistently hitting or exceeding the configured memory, you have confirmed the OOM kill.

Fixing OOM: Task Memory vs Container Memory

For Fargate, both the task-level CPU/memory and the container-level memory must be set correctly. The task-level memory is the total available to all containers, and each container's hard limit must not exceed it.

{
  "family": "my-service",
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "my-app",
      "memory": 1792,
      "memoryReservation": 1024,
      "essential": true
    },
    {
      "name": "datadog-agent",
      "memory": 256,
      "memoryReservation": 128,
      "essential": false
    }
  ]
}

Notice the sidecar container (datadog-agent) is marked essential: false. If the sidecar crashes, the main task continues running. Reserve enough headroom in the task memory for both containers plus some buffer.

Root Cause 3: Essential Container Exit Codes

When a container marked essential: true exits, ECS stops the entire task. Understanding exit codes is the key to diagnosis.

Exit code 1 means your application threw an unhandled exception or explicitly exited with a failure. The root cause is in your application logs:

# Get the log configuration from the task definition
aws ecs describe-task-definition \
  --task-definition my-service:42 \
  --query 'taskDefinition.containerDefinitions[0].logConfiguration'

# Read the last 100 log events
aws logs get-log-events \
  --log-group-name /ecs/my-service \
  --log-stream-name ecs/my-app/a1b2c3d4 \
  --limit 100 \
  --start-from-head false \
  --query 'events[*].message'

Common exit code 1 causes: missing environment variables (the task definition references a Secrets Manager secret that does not exist), failed database connection at startup, port binding conflict, or a bad configuration file.

Exit code 255 typically means the entrypoint or CMD in the Dockerfile failed completely. The container runtime could not execute the specified command. Check your Dockerfile:

# Verify the entrypoint and command in the running task definition
aws ecs describe-task-definition \
  --task-definition my-service:42 \
  --query 'taskDefinition.containerDefinitions[0].{
    entryPoint: entryPoint,
    command: command,
    image: image
  }'

A common cause is a shell script entrypoint that is missing a shebang line or has Windows line endings (CRLF) from being edited on a Windows machine.

Root Cause 4: Placement Constraint Failures

On the EC2 launch type, ECS must find a container instance with enough available CPU, memory, and ports to place the task. When no instance meets the requirements, you get placement failures.

# Check the available resources on container instances
aws ecs list-container-instances \
  --cluster production \
  --query 'containerInstanceArns' \
  --output text | tr '\t' '\n' | while read arn; do
  aws ecs describe-container-instances \
    --cluster production \
    --container-instances "$arn" \
    --query 'containerInstances[0].{
      instance: ec2InstanceId,
      status: status,
      cpu_remaining: remainingResources[?name==`CPU`].integerValue | [0],
      memory_remaining: remainingResources[?name==`MEMORY`].integerValue | [0],
      running_tasks: runningTasksCount
    }'
done

If all instances show low remaining CPU or memory, you need to either scale out your cluster (add more instances) or reduce the resource reservations in your task definitions.

For Fargate tasks, placement failures are rare but can happen if you hit the Fargate task limit for your account or if you are requesting a CPU/memory combination that is not available:

# Check your Fargate on-demand task quota
aws service-quotas get-service-quota \
  --service-code fargate \
  --quota-code L-790F8B95 \
  --query 'Quota.Value'

Root Cause 5: Health Check Failures Causing Task Cycling

This is the most insidious failure mode because the task starts successfully, the application runs, but ECS keeps killing and restarting it. The culprit is a health check that fails before the application is fully ready.

The task cycling pattern looks like this in the service events:

aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].events[0:15].message'

You will see a repeating loop: task registered, target unhealthy, task draining, task stopped, new task started.

The fix is two-fold. First, increase the health check grace period:

aws ecs update-service \
  --cluster production \
  --service my-service \
  --health-check-grace-period-seconds 180

Second, make sure your ALB health check settings are reasonable:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc123 \
  --health-check-interval-seconds 30 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 5 \
  --health-check-timeout-seconds 10

An unhealthy threshold of 5 with a 30-second interval gives your application 150 seconds to start responding before ECS kills it.

Using ECS Exec for Live Debugging

When logs are not enough, ECS Exec lets you open an interactive shell session inside a running container. This is invaluable for checking network connectivity, file system state, or environment variables that the application sees at runtime.

# Enable ECS Exec on the service (requires task role with SSM permissions)
aws ecs update-service \
  --cluster production \
  --service my-service \
  --enable-execute-command

# Exec into a running container
aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:us-east-1:123456789:task/production/xyz789 \
  --container my-app \
  --command "/bin/sh" \
  --interactive

From inside the container, you can test network connectivity to dependencies, check environment variables, and inspect the application's runtime state.

Prevention: Stop These Errors Before They Happen

After debugging hundreds of ECS task failures, here is the prevention checklist we use with every client:

Set up Container Insights — enable it on the cluster to get CPU and memory metrics at the task and container level. Create CloudWatch alarms when memory utilization exceeds 80%.
Use deployment circuit breakers — ECS can automatically roll back a deployment that is failing instead of endlessly cycling tasks:

aws ecs update-service \
  --cluster production \
  --service my-service \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }'

Pin image tags — never use latest in production. Use immutable tags tied to your CI build number or git commit SHA.
Set health check grace periods — always set this to at least twice your application's startup time.
Use ECR image scanning — enable scan-on-push to catch vulnerabilities before they reach ECS.

When to Bring in Help

ECS task failures are often symptoms of deeper architectural issues: undersized infrastructure, missing observability, or deployment pipelines that lack safety nets. If your team is spending more time fighting ECS than building features, it may be time for an expert review.

We help teams audit their ECS configurations, implement proper monitoring, and build deployment pipelines that fail safely. Reach out for a free AWS consultation and let us take a look at your setup.