ECR RepositoryNotFoundException and Image Pull Failures in ECS/EKS

Your ECS deployment just failed. The service event log shows the dreaded message:

CannotPullContainerError: Error response from daemon: pull access denied for
123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app, repository does not exist
or may require 'docker login'

Or perhaps your EKS pod is stuck in ImagePullBackOff:

Failed to pull image "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:v2.1.0":
rpc error: code = NotFound desc = failed to pull and unpack image: failed to
resolve reference: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:v2.1.0:
not found

These errors look simple — the image is not there. But in my experience consulting with AWS teams, the image almost always exists. The problem is usually permissions, authentication, networking, or a subtle naming mismatch. Let me walk you through the systematic approach to diagnosing and fixing ECR image pull failures.

Step 1: Verify the Repository Exists

Start with the obvious. Confirm the repository exists in the expected account and region:

aws ecr describe-repositories \
  --repository-names my-app \
  --region us-east-1 \
  --query 'repositories[0].{Name:repositoryName,URI:repositoryUri,CreatedAt:createdAt}' \
  --output table

If this returns RepositoryNotFoundException, either the repository name is wrong, you are targeting the wrong region, or you are authenticated to the wrong AWS account.

Check your current identity:

aws sts get-caller-identity \
  --query '{Account:Account,Arn:Arn}' \
  --output table

Common mistakes: the repository is named myapp but the task definition references my-app, or the repository is in eu-west-1 but the ECS cluster is in us-east-1.

Root Cause 1: ECR Authentication Token Expired

ECR authentication tokens are valid for only 12 hours. If your deployment pipeline caches the token or your node's kubelet token has expired, image pulls will fail with an access denied error that looks like a repository not found error.

For ECS on EC2, the ECS agent handles authentication automatically — but only if the EC2 instance role has the correct permissions. For EKS, you need to ensure the kubelet can refresh the ECR token.

Check if you can authenticate right now:

aws ecr get-login-password --region us-east-1 | docker login \
  --username AWS \
  --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

If this fails, your IAM permissions are the problem. The role needs at minimum:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "arn:aws:ecr:us-east-1:123456789012:repository/my-app"
    }
  ]
}

Note that ecr:GetAuthorizationToken must have Resource: "*" — it cannot be scoped to a specific repository.

Root Cause 2: Missing ECS Task Execution Role Permissions

This is the most common cause of image pull failures in ECS Fargate. The task execution role (not the task role) is what ECS uses to pull images and send logs. If you recently created a new task definition or changed the execution role, the ECR permissions may be missing.

Check the task execution role:

# Get the execution role from the task definition
aws ecs describe-task-definition \
  --task-definition my-app:latest \
  --query 'taskDefinition.executionRoleArn' \
  --output text

# Check what policies are attached
ROLE_NAME=$(aws ecs describe-task-definition \
  --task-definition my-app:latest \
  --query 'taskDefinition.executionRoleArn' \
  --output text | awk -F'/' '{print $NF}')

aws iam list-attached-role-policies \
  --role-name $ROLE_NAME \
  --query 'AttachedPolicies[*].PolicyName' \
  --output table

The execution role should have the AWS-managed policy AmazonECSTaskExecutionRolePolicy attached, or equivalent custom permissions. If you are using a custom policy, verify it includes:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

Root Cause 3: Image Tag Does Not Exist

You are pulling my-app:v2.1.0 but that tag was never pushed — or it was pushed and then overwritten or deleted by a lifecycle policy.

List the available tags:

aws ecr describe-images \
  --repository-name my-app \
  --query 'imageDetails[*].{Tags:imageTags,PushedAt:imagePushedAt,Size:imageSizeInBytes}' \
  --output table \
  | sort -k3 -r \
  | head -20

Check if a specific tag exists:

aws ecr batch-get-image \
  --repository-name my-app \
  --image-ids imageTag=v2.1.0 \
  --query 'images[0].imageId' \
  --output json

If the tag is missing, check two things:

Lifecycle policies may have deleted the image:

aws ecr get-lifecycle-policy \
  --repository-name my-app \
  --query 'lifecyclePolicyText' \
  --output text | python3 -m json.tool

A common misconfiguration is a lifecycle policy that deletes images older than N days without excluding tagged releases. This policy safely keeps the last 10 tagged images:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep only last 10 tagged images",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["v"],
        "countType": "imageCountMoreThan",
        "countNumber": 10
      },
      "action": {
        "type": "expire"
      }
    },
    {
      "rulePriority": 2,
      "description": "Remove untagged images after 1 day",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}

Tag mutability issues: if your repository allows tag mutability (the default), someone could have pushed a different image to the same tag, or the tag could have been moved. Consider enabling tag immutability for production repositories:

aws ecr put-image-tag-mutability \
  --repository-name my-app \
  --image-tag-mutability IMMUTABLE

Root Cause 4: Cross-Account ECR Access

When your ECS/EKS workloads run in Account B but the ECR repository is in Account A, you need a repository policy in Account A that allows Account B to pull images.

Check the repository policy:

aws ecr get-repository-policy \
  --repository-name my-app \
  --query 'policyText' \
  --output text | python3 -m json.tool

If this returns RepositoryPolicyNotFoundException, no cross-account access has been configured. Set one up:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowCrossAccountPull",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::222222222222:root"
      },
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ]
    }
  ]
}

aws ecr set-repository-policy \
  --repository-name my-app \
  --policy-text file://ecr-policy.json

The consuming account also needs IAM permissions — the repository policy alone is not sufficient. The task execution role in Account B needs ecr:GetAuthorizationToken (on *) and the image pull actions on the repository ARN in Account A.

Root Cause 5: VPC Endpoints Missing for Private Subnets

If your ECS tasks or EKS pods run in private subnets without a NAT gateway, they cannot reach the public ECR endpoints. You need VPC endpoints for ECR.

ECR requires two VPC endpoints — this is a common oversight:

com.amazonaws.us-east-1.ecr.api — for ECR API calls (authentication, describe)
com.amazonaws.us-east-1.ecr.dkr — for Docker image layer downloads

You also need an S3 gateway endpoint because ECR stores image layers in S3:

com.amazonaws.us-east-1.s3 — gateway endpoint for S3

Check if the endpoints exist:

aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-abc123" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State,Type:VpcEndpointType}' \
  --output table

If any of the three are missing, image pulls will hang and eventually time out. The timeout looks like a network error, not a permissions error, which makes it harder to diagnose:

CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy:
httpReadSeeker: failed open: failed to do request: dial tcp
123456789012.dkr.ecr.us-east-1.amazonaws.com:443: i/o timeout

Create the missing endpoints:

# ECR API endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-111 subnet-222 \
  --security-group-ids sg-xxx \
  --private-dns-enabled

# ECR Docker endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids subnet-111 subnet-222 \
  --security-group-ids sg-xxx \
  --private-dns-enabled

# S3 gateway endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --vpc-endpoint-type Gateway \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-xxx

Ensure the security group on the interface endpoints allows inbound HTTPS (port 443) from the security group used by your ECS tasks or EKS pods.

Root Cause 6: Pull-Through Cache Misconfiguration

If you are using ECR pull-through cache rules to mirror images from Docker Hub, Quay, or other registries, the cached image may not have been pulled yet, or the upstream credentials may be invalid.

Check your pull-through cache rules:

aws ecr describe-pull-through-cache-rules \
  --query 'pullThroughCacheRules[*].{Prefix:ecrRepositoryPrefix,Upstream:upstreamRegistryUrl}' \
  --output table

Diagnosis Workflow Summary

When you encounter an ECR image pull failure, run through this checklist:

Verify the repository exists in the correct account and region
Check your authentication — can you get a login token?
Verify the image tag exists — was it deleted by a lifecycle policy?
Check the task execution role for ECR pull permissions
For cross-account: verify both the repository policy and IAM permissions
For private subnets: ensure all three VPC endpoints exist (ecr.api, ecr.dkr, s3)

Prevention Best Practices

Use image digests instead of tags for production deployments. Digests are immutable and cannot be accidentally overwritten or deleted:

# Get the digest for a specific tag
aws ecr batch-get-image \
  --repository-name my-app \
  --image-ids imageTag=v2.1.0 \
  --query 'images[0].imageId.imageDigest' \
  --output text

Enable tag immutability on production repositories to prevent tags from being overwritten.
Audit lifecycle policies regularly to ensure they are not deleting images that are still referenced by running task definitions.
Use ECR image scanning to catch vulnerabilities before deployment, and set up EventBridge rules to alert on critical findings.
Standardize ECR permissions in infrastructure as code. Define the task execution role, repository policy, and VPC endpoints in CloudFormation or CDK so they are always consistent.

Need Help with Container Deployments?

ECR image pull failures block deployments and frustrate engineering teams. If your container deployment pipeline is unreliable, or you are setting up cross-account ECR access and VPC endpoints for the first time, we can help you get it right. Contact us for a free AWS consultation — we specialize in building reliable container deployment pipelines on ECS and EKS.