FinOps

Your AWS Bill Is Out of Control. Here's Where to Look First.

2026-03-16 · 9 min read

You opened your AWS bill this month and the number made your stomach drop. Maybe it crept up gradually — $12K, then $18K, then $27K — or maybe a single line item spiked overnight. Either way, you know something is wrong but you have no idea where to start looking.

I have run cost optimization engagements for over 40 AWS accounts, ranging from early-stage startups burning through runway to enterprise teams managing seven-figure monthly spend. The patterns repeat themselves with remarkable consistency. In this post, I am going to walk you through the five places I always look first, in the order that typically yields the biggest savings.

1. NAT Gateway: The Silent Budget Killer

If I had to bet on a single line item being the root cause of an unexpectedly high AWS bill, I would bet on NAT Gateway every single time.

Here is why NAT Gateways are so expensive. AWS charges $0.045 per hour per NAT Gateway — that is roughly $32/month just to have one running. But the real cost is data processing: $0.045 per GB of data processed. If your private subnet workloads are making frequent API calls to AWS services, pulling container images, downloading dependencies, or streaming logs, every byte of that traffic passes through your NAT Gateway and you pay for it twice — once for the NAT Gateway processing fee, and again for the standard data transfer charge.

I recently audited a client running a microservices platform on ECS in private subnets. Their NAT Gateway bill was $4,200/month — more than their entire compute spend. The culprit: every container pulled its image from ECR through the NAT Gateway on every deployment, and their CloudWatch log agent was streaming gigabytes of verbose debug logs through it daily.

How to check your NAT Gateway costs right now:

# Check NAT Gateway data processing costs for the last 30 days
aws ce get-cost-and-usage \
  --time-period Start=2026-02-01,End=2026-03-01 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NatGateway-Bytes"]
    }
  }' \
  --profile your-profile
# List all NAT Gateways and their associated subnets
aws ec2 describe-nat-gateways \
  --query 'NatGateways[*].[NatGatewayId,SubnetId,State]' \
  --output table

The fix: Deploy VPC endpoints for the AWS services your workloads access most frequently. The typical candidates are S3, DynamoDB, ECR (both ecr.api and ecr.dkr), CloudWatch Logs, STS, and Secrets Manager. A Gateway endpoint for S3 is free. Interface endpoints cost $0.01/hour plus $0.01/GB — still far cheaper than NAT Gateway processing fees when you are moving serious volume.

For the client I mentioned, deploying VPC endpoints for ECR, S3, and CloudWatch Logs reduced their NAT Gateway bill from $4,200/month to $380/month.

2. Data Transfer: Death by a Thousand Cross-AZ Charges

AWS data transfer pricing is one of the most complex and least understood parts of the billing model. Most teams know that egress to the internet is expensive ($0.09/GB for the first 10TB), but they miss the cross-AZ transfer charges that accumulate silently.

Every time a service in us-east-1a talks to a service in us-east-1b, you pay $0.01/GB in each direction — $0.02/GB round trip. That sounds trivial until you realize your application makes thousands of these calls per second.

Common scenarios where cross-AZ transfer costs explode:

  • ECS/EKS services communicating across AZs — if your service mesh routes requests without AZ affinity, every inter-service call may cross AZ boundaries
  • RDS Multi-AZ replication — synchronous replication traffic between AZs is not free
  • ElastiCache clusters — if your Redis cluster spans AZs and your application nodes are not AZ-aware, cache reads cross AZ boundaries constantly
  • ALB to targets in multiple AZs — cross-zone load balancing distributes traffic evenly, which means roughly half your traffic crosses AZ boundaries

How to check data transfer costs:

# Break down data transfer costs by usage type
aws ce get-cost-and-usage \
  --time-period Start=2026-02-01,End=2026-03-01 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" "UsageQuantity" \
  --group-by Type=DIMENSION,Key=USAGE_TYPE \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["Amazon Elastic Compute Cloud - Compute"]
    }
  }' \
  --output json

Look for usage types containing DataTransfer-Regional-Bytes — these are your cross-AZ charges.

The fix: Enable AZ-aware routing wherever possible. For ECS services, use Service Connect with AZ affinity. For ElastiCache, enable reader endpoint AZ affinity. For internal service-to-service traffic, consider whether you actually need Multi-AZ for non-critical services. In development and staging environments, running single-AZ can cut data transfer costs by 40-60%.

3. Idle and Forgotten Resources

Every AWS account I have ever audited has resources that someone provisioned months ago and forgot about. The usual suspects:

  • Unattached EBS volumes — someone deleted an EC2 instance but the volumes persisted. At $0.08/GB/month for gp3, a forgotten 500GB volume costs $40/month forever.
  • Idle Elastic IPs — AWS charges $0.005/hour ($3.65/month) for each EIP that is not attached to a running instance.
  • Old EBS snapshots — incremental snapshots accumulate over time. I have seen accounts with 15TB+ of snapshots for instances that were terminated years ago.
  • Unused Elastic Load Balancers — a running ALB costs a minimum of $16.20/month even with zero traffic.
  • Stopped EC2 instances with attached EBS — the instance is not running but the storage keeps billing.

How to find idle resources:

# Find unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
  --output table

# Find Elastic IPs not associated with an instance
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
  --output table

# Find load balancers with zero healthy targets
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn> \
  --query 'TargetHealthDescriptions[?TargetHealth.State!=`healthy`]'

The fix: Set up AWS Config rules to automatically detect idle resources. The managed rules ec2-volume-inuse-check, eip-attached, and elb-deletion-protection-enabled are good starting points. For ongoing hygiene, create a monthly cleanup process and tag every resource with an owner and expiration date.

4. Over-Provisioned Compute

This is the most common cost problem and the easiest to fix. Most teams provision EC2 instances based on peak load estimates that never materialize, then never revisit the decision.

The numbers are staggering. AWS's own data shows that the average EC2 instance runs at less than 15% CPU utilization. That means most customers are paying for 6-7x more compute than they actually use.

How to identify over-provisioned instances:

# Get Compute Optimizer recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType,finding]' \
  --output table

If you have not enrolled in Compute Optimizer, do it now — it is free:

aws compute-optimizer update-enrollment-status --status Active

Beyond EC2, check these compute resources:

  • RDS instances — a db.r6g.2xlarge running at 5% CPU should be a db.r6g.large
  • ElastiCache nodes — Redis clusters are frequently over-provisioned because teams are afraid of evictions
  • Lambda functions — a function allocated 1024MB of memory that peaks at 200MB is paying 5x too much (Lambda pricing scales linearly with memory allocation)
  • ECS/Fargate tasks — task definitions with 4 vCPU / 8GB RAM running containers that use 0.5 vCPU / 1GB are wasting 75% of their allocation

The fix: Right-size in phases. Start with the biggest instances, reduce by one size, monitor for two weeks, then repeat. For Lambda, use AWS Lambda Power Tuning (an open-source tool) to find the optimal memory setting for each function. For ECS, enable Container Insights and compare actual resource utilization to task definition allocations.

A typical right-sizing exercise across a fleet of 50-100 instances yields 25-40% compute cost reduction with zero performance impact.

5. Missing Savings Plans and Reserved Instances

Once you have eliminated waste, the next step is to commit to your baseline usage. On-Demand pricing is the most expensive way to run on AWS, and most production workloads have a predictable baseline that makes commitment discounts a no-brainer.

The savings are significant:

Commitment1-Year3-Year
Compute Savings Plan (No Upfront)17-20%30-36%
Compute Savings Plan (All Upfront)28-33%43-50%
EC2 Instance Savings Plan (All Upfront)36-40%55-60%

How to check your Savings Plan coverage:

# Check current Savings Plan utilization
aws ce get-savings-plans-utilization \
  --time-period Start=2026-02-01,End=2026-03-01

# Get Savings Plan purchase recommendations
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days SIXTY_DAYS

The fix: Start with Compute Savings Plans — they are the most flexible because they apply across EC2, Fargate, and Lambda regardless of instance family, size, OS, or region. Only after maximizing Compute Savings Plans should you consider EC2 Instance Savings Plans for workloads you know will not change instance families.

A common mistake is buying Reserved Instances for resources you plan to right-size. Always right-size first, stabilize for 2-4 weeks, then commit.

Where to Start: The One-Hour Cost Audit

If you want to get a quick picture of where your money is going, here is the fastest path:

  1. Open Cost Explorer and group by Service for the last 3 months. Identify the top 5 services by spend.
  2. Switch to Usage Type grouping within your top service. This reveals whether you are paying for compute, data transfer, storage, or API calls.
  3. Run the CLI commands above for NAT Gateway, idle resources, and Compute Optimizer.
  4. Check Savings Plan coverage — if your coverage is below 60%, you are leaving money on the table.
  5. Tag analysis — go to Cost Allocation Tags and check how many of your resources are untagged. Untagged resources are invisible to cost attribution, which means no one is accountable for them.

In my experience, this one-hour exercise typically reveals 20-40% in potential savings. The hard part is not finding the waste — it is building the organizational muscle to prevent it from coming back.

When to Bring in Help

If your monthly AWS spend exceeds $20K and you do not have a dedicated FinOps practice, you are almost certainly overspending by at least 30%. A structured cost optimization engagement typically pays for itself within the first month.

I offer a free 30-minute consultation where we review your Cost Explorer dashboard together and I identify the top three areas for immediate savings. No commitment, no sales pitch — just a straightforward look at the numbers.

Book a free consultation and let's find out where your AWS budget is actually going.

Need help with your AWS infrastructure?

Book a free 30-minute consultation to discuss your challenges.