ECS Task Stopped: How to Read Exit Codes and CloudWatch Logs to Find the Root Cause
2026-05-18 · 9 min read
Your ECS service is failing. The tasks keep stopping and restarting. The service event log shows this:
service my-service has reached a steady state.
service my-service was unable to place a task because no container instance
met all of its requirements.
service my-service (instance i-0abc123) (port 8080) is unhealthy in
target-group my-tg. Stopping and restarting task.
Or maybe you see this in your deployment:
Essential container in task exited
These messages tell you something is wrong but not what. To find the root cause, you need to understand exit codes, read CloudWatch logs effectively, and know the right CLI commands to pull diagnostic information from the ECS API.
Step 1: Get the Task Details
When a task stops, ECS records a stoppedReason and the container exit codes. Start here:
# List recently stopped tasks
aws ecs list-tasks \
--cluster my-cluster \
--service-name my-service \
--desired-status STOPPED \
--query 'taskArns[0:5]' \
--output text
# Describe a stopped task to get the full details
aws ecs describe-tasks \
--cluster my-cluster \
--tasks arn:aws:ecs:us-east-1:123456789:task/my-cluster/abc123 \
--query 'tasks[0].{
stoppedReason: stoppedReason,
stopCode: stopCode,
containers: containers[*].{
name: name,
exitCode: exitCode,
reason: reason,
lastStatus: lastStatus
},
stoppedAt: stoppedAt,
startedAt: startedAt,
healthStatus: healthStatus
}'
This gives you the exit code for each container and the reason the task was stopped. The stopCode field tells you whether ECS stopped the task (ServiceSchedulerInitiated, TaskFailedToStart) or the container stopped itself (EssentialContainerExited).
Step 2: Understand the Exit Code
The exit code is the most important piece of diagnostic information. Here is what each common code means:
Exit Code 0: Clean Shutdown
The container exited normally. This usually means:
- The container's main process completed (expected for batch/one-off tasks, unexpected for long-running services)
- The application received a SIGTERM and shut down gracefully
If this is a service task that should run continuously, an exit code of 0 means your application's main process is finishing instead of running indefinitely. Check your entrypoint script — is it starting the application in the background and then exiting? A common mistake:
# WRONG: starts the app in the background, then the script exits
CMD ["sh", "-c", "node server.js &"]
# CORRECT: runs the app in the foreground
CMD ["node", "server.js"]
Exit Code 1: Application Error
The most common exit code. The application threw an unhandled exception, a fatal error, or explicitly exited with code 1. This is where CloudWatch Logs become essential — the application should have logged the error before exiting.
Common causes:
- Missing environment variable — the application tried to read a required config value that was not set in the task definition
- Database connection failure — the application cannot reach RDS, ElastiCache, or another dependency at startup
- Port binding failure — the application tried to listen on a port that is already in use or not permitted
- Syntax error or missing dependency — a code deployment introduced a breaking change
Exit Code 137: OOM Kill (SIGKILL)
Exit code 137 means the container was killed by a SIGKILL signal (128 + 9 = 137). In ECS, this almost always means the container exceeded its memory limit and was killed by the OOM (Out of Memory) killer.
How to confirm:
# Check if the task was stopped due to OOM
aws ecs describe-tasks \
--cluster my-cluster \
--tasks arn:aws:ecs:us-east-1:123456789:task/my-cluster/abc123 \
--query 'tasks[0].containers[*].{name:name, exitCode:exitCode, reason:reason}'
If the reason field says OutOfMemoryError: Container killed due to memory usage, you have your answer.
The fix: Increase the memory allocation in the task definition. For Fargate tasks, this also means selecting a larger task size. For EC2 launch type, increase the memory or memoryReservation in the container definition.
But before simply increasing memory, investigate why the container is using so much. Common culprits:
- Memory leaks — Node.js applications that never release event listeners, Java applications with growing heap, Python applications holding references to large data structures
- Unbounded caching — in-memory caches without eviction policies
- Large file processing — reading an entire file into memory instead of streaming
Exit Code 139: Segmentation Fault (SIGSEGV)
Exit code 139 (128 + 11) indicates a segmentation fault. The container's process tried to access memory it was not allowed to access. This is rare in managed languages (Node.js, Python, Java) but common in containers running C/C++ code or native libraries.
The fix: This is usually a bug in native code. Check if you recently updated a native dependency or changed the base Docker image.
Exit Code 143: Graceful Termination (SIGTERM)
Exit code 143 (128 + 15) means the container received a SIGTERM signal and exited. This is the expected behavior during:
- Deployments — ECS sends SIGTERM to old tasks when deploying new ones
- Scale-in events — ECS terminates excess tasks when scaling down
- Manual task stops — someone ran
aws ecs stop-task
If you see exit code 143 during a deployment, this is normal. The stopTimeout setting in the task definition controls how long ECS waits after sending SIGTERM before sending SIGKILL. The default is 30 seconds.
If your application needs more time to drain connections and finish in-flight requests, increase the stopTimeout:
{
"containerDefinitions": [{
"name": "my-app",
"stopTimeout": 120
}]
}
No Exit Code (Task Failed to Start)
Sometimes describe-tasks shows no exit code at all, and the stoppedReason says something like:
CannotPullContainerError: Error response from daemon: pull access denied
for 123456789.dkr.ecr.us-east-1.amazonaws.com/my-image, repository does
not exist or may require 'docker login'
This means the task failed before any container started. Common causes:
- Image pull failure — the ECR repository does not exist, the image tag does not exist, or the task execution role does not have
ecr:GetDownloadUrlForLayerpermission - Secrets Manager or Parameter Store failure — the task definition references a secret that does not exist or the task execution role cannot access it
- Resource exhaustion — no EC2 instances in the cluster have enough CPU or memory to place the task (EC2 launch type only)
# Check the stopped reason for tasks that failed to start
aws ecs describe-tasks \
--cluster my-cluster \
--tasks arn:aws:ecs:us-east-1:123456789:task/my-cluster/abc123 \
--query 'tasks[0].stoppedReason'
Step 3: Read CloudWatch Logs
For exit codes 1 and 137, the application logs are where you find the actual error. ECS tasks log to CloudWatch Logs using the awslogs log driver.
Find the log group:
The log group is defined in the task definition's logConfiguration:
aws ecs describe-task-definition \
--task-definition my-task:latest \
--query 'taskDefinition.containerDefinitions[*].logConfiguration'
This returns something like:
{
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/my-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
Read the logs for a specific task:
The log stream name follows the pattern: {prefix}/{container-name}/{task-id}. For a task with ID abc123def456 in a container named my-app with prefix ecs, the log stream is ecs/my-app/abc123def456.
# Get the last 50 log events from a task
aws logs get-log-events \
--log-group-name /ecs/my-service \
--log-stream-name ecs/my-app/abc123def456 \
--limit 50 \
--start-from-head false
Use CloudWatch Logs Insights for pattern analysis:
If tasks are failing intermittently, query across all log streams to find common error patterns:
fields @timestamp, @message
| filter @message like /(?i)(error|exception|fatal|panic|killed)/
| sort @timestamp desc
| limit 100
For OOM investigations:
fields @timestamp, @message
| filter @message like /(?i)(memory|heap|oom|allocation)/
| sort @timestamp desc
| limit 50
Step 4: Check Health Check Configuration
A frequently overlooked cause of ECS task failures is an overly aggressive health check. If the ALB health check marks a target as unhealthy, ECS stops and replaces the task. This creates a loop:
- Task starts
- Application initializes (takes 30 seconds)
- ALB health check fails (because the app is not ready yet)
- ECS stops the task
- ECS starts a new task
- Repeat
Check your target group health check settings:
aws elbv2 describe-target-groups \
--names my-target-group \
--query 'TargetGroups[0].{
HealthCheckPath: HealthCheckPath,
HealthCheckIntervalSeconds: HealthCheckIntervalSeconds,
HealthyThresholdCount: HealthyThresholdCount,
UnhealthyThresholdCount: UnhealthyThresholdCount,
HealthCheckTimeoutSeconds: HealthCheckTimeoutSeconds
}'
If your application takes 30 seconds to start but the health check expects a healthy response within 10 seconds with 2 consecutive checks, the task will be killed before it can ever become healthy.
The fix: Increase the health check grace period in the ECS service definition, and set reasonable health check intervals:
# Update the service with a longer health check grace period
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--health-check-grace-period-seconds 120
Also consider adding a /health endpoint to your application that returns 200 only after all dependencies (database connections, cache connections) are established.
Step 5: Check Service Events
The ECS service event log provides a timeline of what ECS has been doing with your tasks:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[0].events[0:10]'
Look for repeating patterns:
has begun draining connectionsfollowed byregistered targetsin a loop indicates a health check failure cyclewas unable to place a taskindicates resource constraints (CPU, memory, or port conflicts on EC2 instances)has reached a steady statefollowed quickly byis unhealthyindicates the application starts but then crashes
The Debugging Checklist
When an ECS task stops unexpectedly, work through this checklist:
- (1 min) Run
describe-tasksto get the exit code and stopped reason - (1 min) Interpret the exit code (0=clean exit, 1=app error, 137=OOM, 139=segfault, 143=SIGTERM)
- (5 min) Read CloudWatch Logs for the specific task to find error messages
- (2 min) Check the task definition for recent changes (environment variables, image tag, memory limits)
- (2 min) Check health check configuration and grace period
- (2 min) Review service events for placement or health check failure patterns
- (2 min) If no exit code, check for image pull errors or secrets access failures
In my experience, roughly 40% of ECS task failures are exit code 1 (application errors, usually missing config or failed dependencies), 25% are exit code 137 (OOM kills from undersized memory allocations), 20% are health check failures (aggressive timeouts during startup), and 15% are infrastructure issues (image pull failures, secrets access, resource placement). Start with the exit code, follow it to the logs, and you will find the root cause in under 15 minutes.
Need help with your AWS infrastructure?
Book a free 30-minute consultation to discuss your challenges.