AWS Well-Architected Review: What I Look For (And What I Almost Always Find)

The AWS Well-Architected Framework is a structured way to evaluate cloud workloads against best practices across six pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. AWS offers a free tool in the console to run through the review questions yourself, and many teams do exactly that — checking boxes, noting a few areas for improvement, and filing the results somewhere they will never look at again.

That is not how I run a Well-Architected Review. When I conduct a review for a client, I go beyond the questionnaire. I read CloudFormation templates, inspect IAM policies, review CloudTrail logs, analyze CloudWatch metrics, and test disaster recovery procedures. The review typically takes 3-5 days and results in a prioritized action plan with specific remediation steps.

After conducting these reviews for companies ranging from 5-person startups to 200-person engineering organizations, I can tell you that the same five findings appear in nearly every account. Let me walk through each one.

The Six Pillars (Brief Overview)

Before diving into the common findings, here is a quick summary of what each pillar covers:

Operational Excellence — how you run and monitor systems. Are deployments automated? Do you have runbooks for incidents? Can you detect problems before customers do?

Security — identity management, detection, infrastructure protection, data protection. Who has access to what? Are credentials rotated? Is data encrypted?

Reliability — the ability to recover from failures. What happens when an AZ goes down? How long until your system is operational again? Do you test recovery procedures?

Performance Efficiency — using compute resources efficiently. Are you using the right instance types? Is your database choice appropriate for your access patterns?

Cost Optimization — eliminating waste and optimizing spend. I have written extensively about this in other posts, so I will keep the focus here on the non-cost findings.

Sustainability — minimizing environmental impact. Right-sizing and efficient architecture patterns contribute here as well.

Now, the five findings I almost always document.

Finding 1: No Disaster Recovery Plan (Or an Untested One)

This is the most critical finding and the most common. When I ask teams "What happens if your primary region goes down?", the most frequent answer is silence followed by "We have not thought about that."

The teams that do have a DR plan often have one that was written two years ago and never tested. The architecture has evolved, new services have been added, and the DR runbook references resources that no longer exist.

What I look for:

RTO and RPO definitions — has the business defined how long they can be down (Recovery Time Objective) and how much data they can afford to lose (Recovery Point Objective)? Most teams have not.
Cross-region replication — are RDS snapshots replicated to another region? Is S3 data replicated? Are AMIs copied?
Infrastructure as Code — can you recreate the entire environment from code? If your CloudFormation or CDK stacks only exist in one region, recovery means manually recreating infrastructure under pressure.
DNS failover — is Route 53 configured with health checks and failover routing? Or is the DNS pointed directly at a single load balancer?
Tested recovery — has anyone actually tried to bring up the system in the DR region in the last 12 months?

The remediation:

Start with defining RTO and RPO with business stakeholders. For most web applications, an RTO of 4 hours and RPO of 1 hour is acceptable and achievable without enormous cost. Then implement the minimum infrastructure:

Enable cross-region RDS automated backups or set up read replicas in the DR region
Enable S3 Cross-Region Replication for critical buckets
Ensure your IaC can deploy to any region with a parameter change
Configure Route 53 health checks on your primary endpoint
Schedule quarterly DR drills — actually bring up the system in the DR region and verify it works

A DR plan you have never tested is not a DR plan. It is a hope.

Finding 2: Over-Permissive IAM Policies

The principle of least privilege is universally agreed upon and almost universally violated. In practice, here is what I typically find:

Application roles with * resource ARNs — an ECS task role that can s3:GetObject on every bucket in the account, not just the one it needs
Lambda execution roles with AmazonDynamoDBFullAccess — a managed policy that grants read, write, delete, and admin permissions when the function only reads from a single table
Developer IAM users with AdministratorAccess — because setting up granular policies was "too complicated" during the initial setup
Unused IAM roles and users — former employees or decommissioned services with active credentials

How I assess IAM posture:

# Find IAM policies with wildcard resources
aws iam list-policies --scope Local \
  --query 'Policies[*].Arn' --output text | \
  xargs -I {} aws iam get-policy-version \
    --policy-arn {} \
    --version-id $(aws iam get-policy --policy-arn {} \
      --query 'Policy.DefaultVersionId' --output text) \
    --query 'PolicyVersion.Document'

# Find IAM users with console access who have not logged in for 90+ days
aws iam generate-credential-report && \
aws iam get-credential-report \
  --query 'Content' --output text | base64 -d

The credential report shows the last login date, last API access date, and MFA status for every IAM user. Any user who has not logged in for 90 days should be reviewed and likely deactivated.

The remediation:

Replace managed policies with custom policies scoped to specific resource ARNs. Yes, this takes time. Yes, it is worth it.
Enable IAM Access Analyzer — it identifies resources shared outside your account and generates least-privilege policy recommendations based on actual API activity.
Enforce MFA for all human users, especially those with console access.
Implement permission boundaries for developer roles — they can create resources but cannot escalate their own privileges.
Set up quarterly access reviews — review the credential report and deactivate stale users and roles.

# Enable IAM Access Analyzer
aws accessanalyzer create-analyzer \
  --analyzer-name account-analyzer \
  --type ACCOUNT

Finding 3: Missing Encryption at Rest

AWS makes encryption at rest straightforward for most services — often it is a single parameter. Yet I regularly find:

RDS instances with StorageEncrypted: false — the default for RDS is unencrypted, and you cannot enable encryption on an existing instance without creating an encrypted snapshot and restoring from it
EBS volumes without encryption — the default is unencrypted unless you enable account-level EBS encryption
S3 buckets without default encryption — while S3 now encrypts all new objects by default with SSE-S3, older buckets may have objects uploaded before this default was enforced
DynamoDB tables without encryption — DynamoDB now encrypts all tables by default with AWS-owned keys, but some organizations require customer-managed KMS keys for compliance
ElastiCache clusters without encryption — encryption at rest is optional and must be enabled at cluster creation time; it cannot be added after the fact

The quickest win:

# Enable default EBS encryption for the account
aws ec2 enable-ebs-encryption-by-default

# Verify it is enabled
aws ec2 get-ebs-encryption-by-default

This single command ensures every new EBS volume in the account is automatically encrypted. It does not affect existing volumes, but it prevents the problem from growing.

The remediation for existing unencrypted resources:

For RDS: create an encrypted snapshot of the unencrypted instance, restore from the encrypted snapshot, update your application to point to the new instance, and decommission the old one. This requires a maintenance window.

For S3: enable default encryption on all buckets and run a batch operation to re-encrypt existing objects with SSE-KMS if required by compliance.

For ElastiCache: unfortunately, you must create a new encrypted cluster and migrate data. There is no in-place encryption option.

Finding 4: No Cost Allocation Tags

I have written about cost optimization extensively, but one specific finding deserves its own section in the Well-Architected context: the absence of cost allocation tags.

Without tags, you cannot answer basic questions:

How much does each team spend?
What is the cost of running service X vs. service Y?
Which environment (prod vs. dev) consumes more resources?
Are non-production resources being cleaned up?

What I typically find:

40-60% of resources have no tags at all
The resources that are tagged use inconsistent naming (some say env, others say Environment, others say environment)
No tag policies are enforced, so tagging is entirely voluntary

The remediation:

Define a tagging standard — four to six required tags with defined allowed values
Enable AWS Organizations Tag Policies to enforce the standard
Activate cost allocation tags in the Billing console
Use AWS Config rule required-tags to detect untagged resources
Set a deadline: all resources tagged within 30 days, with weekly compliance reports

# Check tagging compliance with AWS Config
aws configservice put-config-rule \
  --config-rule '{
    "ConfigRuleName": "required-tags",
    "Source": {
      "Owner": "AWS",
      "SourceIdentifier": "REQUIRED_TAGS"
    },
    "InputParameters": "{\"tag1Key\":\"Environment\",\"tag2Key\":\"Team\",\"tag3Key\":\"Service\"}"
  }'

Finding 5: Single-AZ Deployments in Production

Running production workloads in a single Availability Zone is the most straightforward reliability risk to fix, yet I find it more often than you would expect. Common examples:

RDS Single-AZ — if the AZ experiences an issue, your database is down until AWS resolves it or you restore from a snapshot (which takes 30-60 minutes for a large database)
ECS/EKS services pinned to one subnet — if that AZ goes down, your containers go down
Elasticsearch/OpenSearch single-AZ domains — no redundancy for your search infrastructure
Single NAT Gateway — if it fails, all outbound traffic from private subnets stops

The remediation:

Enable RDS Multi-AZ for production databases. The cost is approximately 2x the single-AZ price, but the automatic failover (typically 60-90 seconds) is worth it for any production workload.
Deploy ECS services across at least two subnets in different AZs with a desiredCount of at least 2.
Use OpenSearch Multi-AZ with standby for production search domains.
Deploy one NAT Gateway per AZ and configure route tables so each private subnet uses the NAT Gateway in its own AZ.

# Check if an RDS instance is Multi-AZ
aws rds describe-db-instances \
  --query 'DBInstances[*].[DBInstanceIdentifier,MultiAZ,AvailabilityZone]' \
  --output table

The ROI of a Well-Architected Review

A common question I get from CTOs and engineering leads: "Is this worth the investment?"

Here is what I tell them. The review itself typically costs from EUR 2,500 depending on the complexity of the environment. The findings typically fall into three categories:

Cost savings — in my experience, the cost optimization findings alone pay for the review within 30 days. A single right-sizing recommendation or a NAT Gateway optimization can save thousands per month.
Risk reduction — a disaster recovery gap, an over-permissive IAM policy, or unencrypted data at rest represents real business risk. The review quantifies that risk and provides a remediation plan.
Performance improvement — right-sized instances, optimized database queries, and proper caching configurations improve user experience directly.

Most clients tell me the review pays for itself within the first month through cost savings alone, before accounting for the risk reduction and performance improvements.

Next Steps

If your AWS environment has been running for more than 12 months without a structured review, the five findings I described above are likely present. The question is not whether they exist, but how severe they are.

I conduct Well-Architected Reviews as structured 3-5 day engagements, resulting in a prioritized findings report with specific remediation steps and estimated effort for each item. Every finding includes the business impact — whether that is cost, risk, or performance — so you can make informed decisions about what to fix first.

Book a free consultation to discuss whether a Well-Architected Review is the right starting point for your environment.

AWS Well-Architected Review: What I Look For (And What I Almost Always Find)

The Six Pillars (Brief Overview)

Finding 1: No Disaster Recovery Plan (Or an Untested One)

Finding 2: Over-Permissive IAM Policies

Finding 3: Missing Encryption at Rest

Finding 4: No Cost Allocation Tags

Finding 5: Single-AZ Deployments in Production

The ROI of a Well-Architected Review

Next Steps

Need help with your AWS infrastructure?