How We Cut Our AWS Bill ~45% Without Slowing Anyone Down

We were spending around $30,000 a month on AWS. Not because anyone made a bad call—we were running a legitimate security SaaS at Matters.AI, with several environments, a sizable EKS cluster, managed databases, and message brokers across multiple AWS accounts. The bill felt proportionate to the work.

Then we started actually looking at it.

Over two quarters, we cut the bill to ~$16,500/month—roughly 45%. No features were removed. No one's deploy velocity dropped. The work wasn't glamorous: it was a series of specific, measurable fixes, each one boring in the right way. This post is about those fixes, in the order we found them.

Why top-down cost mandates don't work

"We need to cut cloud costs by 20% next quarter" is a complete non-starter without attribution. When no one can tell you which team is driving which line item, you get theater: engineers shuffling instance types, ops opening a dozen Cost Explorer tabs, and no one actually stopping the bleeding.

We have a multi-account AWS Organization: Management (payer), Production, Staging, Development, Networking, and a few sandboxes. Before we touched a single resource, we had to be able to say "Production is $X/month, Staging is $Y, and here's the service breakdown inside each." That's the only thing that turns a cost mandate into an actionable list.

Cost Explorer's Linked Account dimension is the starting point. Once you can see per-account costs, you break down by Service, and then by Usage Type when a service line is suspiciously large. The Usage Type level is where the actual money is hiding.

November 2025—our first full month with this level of visibility—showed a consolidated $24,216:

Account	Monthly Spend
Production	$15,909
Staging	$6,518
Development	$986
Networking	$681
Management	$122 (payer)

Production at $15,909 is the right place to start. Within Production, the top line items were:

Service	6-month total
Amazon Virtual Private Cloud	$19,190
EC2 - Other	$15,523
Amazon ElastiCache	$8,972
AWS Secrets Manager	$7,107
Amazon EC2 Compute	$6,791
Amazon Elastic Load Balancing	$5,321

VPC at $19,190 for six months deserves its own section. So does Secrets Manager at $7,107—that's a number that shouldn't surprise anyone, but usually does.

The six levers

1. 3-Year No-Upfront Compute Savings Plan

This one is arithmetic, not engineering.

We'd been running EC2 and EKS nodes entirely on-demand and spot. In November 2025—before the Savings Plan was active—we consumed 37,560 on-demand instance-hours at $2,822 and 9,706 spot-hours at $2,343 in our Production account alone. That's $5,165 in a single month just for compute.

The Savings Plan principle is simple: identify the floor of your compute usage—the baseline that's never going to zero—and commit to that. The spiky, unpredictable top stays on-demand or spot. We committed $3.214/hr on a 3-year Compute Savings Plan (No Upfront) on November 25, 2025.

The effect by December was immediate. The same Production account that spent $2,822 on on-demand EC2 in November spent $40 on on-demand in December. The Savings Plan covered 23,548 instance-hours automatically.

From the utilization data:

Month	SP Utilization	Net Monthly Savings
Nov 2025	76.4%	$206
Dec 2025	74.9%	$1,023
Jan 2026	84.0%	$1,511
Feb 2026	94.9%	$1,842
Mar 2026	98.9%	$2,247
Apr 2026	87.2%	$1,600

Total net savings over six months: $8,430. Annual run-rate: ~$22,000. The commitment cost is $2,391/month. The on-demand equivalent being displaced averages $3,900–$4,600/month.

The discipline here is resisting the urge to over-commit. We sized the commitment to our actual measured baseline, not our P50 or our gut. The months where utilization dips below 90% (November, April) are months where something shifted in our compute mix—those gaps tell you where to tune next.

2. Per-AZ NAT Gateways

This one is a silent tax that almost no one catches until they break down EC2 - Other by Usage Type.

Our VPC module had a single NAT Gateway, pinned to the first public subnet:

# BEFORE — single NAT, single AZ
resource "aws_eip" "nat" {
  domain = "vpc"
}

resource "aws_nat_gateway" "this" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # always ap-south-1a
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.this.id
  # one route table for ALL private subnets
}

Our VPC spans ap-south-1a and ap-south-1b. When a pod running in ap-south-1b needs to reach the internet, its traffic crosses to ap-south-1a to hit the NAT, then goes out. AWS charges $0.01/GB each direction for cross-AZ traffic. The pod pays to cross the AZ boundary twice on the way out, and the response pays again on the way back.

The bill over six months: $9,084 in NAT processing (data throughput fees across both accounts) plus $5,242 in cross-AZ regional data transfer. That's $1,514/month in NAT costs and $874/month in avoidable inter-zone transfer. The fix is one NAT per AZ, with one route table per AZ pointing to the local NAT:

# AFTER — one NAT per AZ, separate route tables
resource "aws_eip" "nat" {
  count  = length(var.azs)
  domain = "vpc"

  tags = merge(var.tags, {
    Name = "${var.name}-nat-eip-${count.index}"
  })
}

resource "aws_nat_gateway" "this" {
  count         = length(var.azs)
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(var.tags, {
    Name = "${var.name}-nat-gw-${count.index}"
  })
}

resource "aws_route_table" "private" {
  count  = length(var.azs)
  vpc_id = aws_vpc.this.id

  tags = merge(var.tags, {
    Name = "${var.name}-private-rt-${count.index}"
  })
}

resource "aws_route" "private_nat" {
  count                  = length(var.azs)
  route_table_id         = aws_route_table.private[count.index].id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.this[count.index].id
}

resource "aws_route_table_association" "private" {
  count          = length(var.private_subnets)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

You pay slightly more in hourly NAT fees (two gateways instead of one), but the cross-AZ transfer cost drops to near zero for NAT-bound traffic. At our traffic volumes, this nets out to a meaningful reduction each month.

The reason this stays hidden so long: it never shows up as a named line item. You see "EC2 - Other: $X" and you move on. You have to drill into Usage Type and look specifically for NatGateway-Bytes alongside DataTransfer-Regional-Bytes in the same month.

3. MongoDB Atlas: public endpoint → VPC peering

When we first deployed Atlas, we used its public endpoint. The database cluster had a public hostname; our EKS pods resolved that hostname, connected over port 27017, and the traffic path was:

pod (private subnet) → NAT Gateway → internet → Atlas public endpoint

Every byte of MongoDB traffic paid the NAT data-processing fee ($0.045/GB in ap-south-1) plus internet egress ($0.09/GB for the first 10 TB/month). For a database-heavy application, that adds up fast.

VPC peering eliminates both charges. Atlas provisions a peer VPC on its side; you accept the peering request on yours, and then you route the Atlas CIDR directly through the peering connection:

resource "aws_route" "mongodb_peer" {
  route_table_id            = var.private_route_table_id
  destination_cidr_block    = var.mongodb_vpc_cidr_block
  vpc_peering_connection_id = var.mongodb_vpc_peering_connection_id
}

Traffic now goes:

pod (private subnet) → VPC peering connection → Atlas cluster

No NAT. No internet. No data-processing fee on our side. The remaining cost (cross-AZ peering transfer) is $0.01/GB per direction—an order of magnitude cheaper than what we were paying before.

The security posture improved too. The Atlas cluster could remove its public endpoint entirely. There's no longer a publicly resolvable address that, if misconfigured, could expose the database. Cost and security closed in the same PR.

4. Secrets Manager: orphaned secrets and the IRSA path

AWS Secrets Manager charges $0.40/secret/month plus $0.05 per 10,000 API calls. That sounds cheap until you have ~2,400 secrets and 49 million API calls per month.

Our production account's Secrets Manager bill from the six-month period: $7,107, growing from $893/month in November to $1,774/month in April. The growth was the warning sign—something was accumulating secrets faster than we were cleaning them up.

Breaking down the Usage Type:

Usage Type	6-month cost	Monthly average
`APS3-AWSSecretsManager-Secrets` (storage)	$5,636	$939
`APS3-AWSSecretsManagerAPIRequest`	$1,471	$245

At $0.40/secret/month, $939/month implies roughly 2,350 active secrets on average. The culprit: EventBridge API Destination connections. Our platform creates an EventBridge connection per customer integration, and AWS automatically creates a managed secret for each connection's credentials. When a customer offboarded or a connection was deprecated, the connection was sometimes left in place—along with its secret.

The fix is a two-step audit:

List all EventBridge connections in the account. For each one, verify there's an active integration still referencing it. Delete orphaned connections. AWS cascades the secret deletion automatically.
For any remaining Secrets Manager usage that's just storing application credentials (database passwords, API keys), evaluate whether it can be replaced with IRSA.

IRSA (IAM Roles for Service Accounts) lets a Kubernetes pod assume an IAM role with zero credentials stored anywhere. For AWS-native services—S3, DynamoDB, SQS, ECR—you stop storing access keys in Secrets Manager entirely. For EKS, the setup is already in place; you just need the service account annotation and an IAM role:

resource "aws_iam_role" "service_role" {
  name = "${var.app_name}-${var.environment}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = var.oidc_provider_arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_issuer}:sub" = "system:serviceaccount:${var.namespace}:${var.app_name}"
          "${var.oidc_issuer}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

# Kubernetes ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-service
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::471112501839:role/my-service-prod

The pod gets short-lived credentials via the OIDC token. Nothing stored in Secrets Manager. The Secrets Manager API call count drops proportionally, which also cuts the API request billing.

Secrets Manager is one of those services where the cost creep is invisible until you look—$0.40/secret/month feels like a rounding error until you have 2,000 of them.

5. EKS Auto mode for scan workloads

Our platform runs thousands of ephemeral security-scan pods. They're triggered by customer events (a code push, a new repository connected, a scheduled scan cycle), run for minutes to hours, then terminate. The workload profile is textbook Spot: short-lived, interruptible, no persistent state.

EKS Auto mode handles this transparently. With compute_config.enabled = true, EKS selects instance types and capacity from a fleet of 30+ instance families—matching workload resource requests to the cheapest available capacity at that moment. For our scan queues (exposure-scan, endpoint-agent-scan, OPA evaluation), this means the scheduler naturally gravitates toward unused capacity at spot-equivalent pricing without us maintaining separate node group configurations per instance family.

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 21.0"

  name               = var.cluster_name
  kubernetes_version = "1.34"

  compute_config = {
    enabled    = true
    node_pools = ["system", "general-purpose"]
    node_role_arn = var.eks_node_role_arn
  }

  # ...
}

What this looked like in practice: in November 2025, before EKS Auto mode was fully calibrated, we consumed 9,706 spot-hours at $2,343 alongside 37,560 on-demand hours at $2,822. After Auto mode with a dialed-in Savings Plan covering the baseline, the on-demand hours dropped by ~90%. The scan workloads that previously required explicit Spot node groups now run on Auto-selected capacity with no YAML changes on the application side.

The management overhead for diverse instance fleets shows up in Cost Explorer as EKS-Auto:{instance_type}-management-hours—currently around $718/month across 30+ instance types. That's the tradeoff: EKS does the bin-packing, you pay for the orchestration overhead. At our scale, the compute savings more than offset it.

6. Public IPv4 address audit

In February 2024, AWS started charging $0.005/hour (~$3.60/month) for every public IPv4 address—in-use or idle. Before that change, public IPs were effectively free, which meant nobody tracked them.

We ran an audit across our accounts and found two categories of waste. First, idle addresses: EIPs allocated but not attached to anything. These are pure money sinks—$3.60/month each with zero traffic flowing through them. Second, in-use addresses attached to resources that didn't need to be public at all: NAT gateways (necessary), load balancers (necessary), but also a handful of EC2 instances and old network interfaces that had accumulated over time.

From Cost Explorer, Usage Type PublicIPv4:IdleAddress is the tell. Production alone ran $585 in-use and a small idle tail over six months. Across all accounts with active workloads—Production, Staging, Development, Networking—the org-wide total came to roughly $1,044 over six months (~$174/month). Not the biggest lever, but the fix is a one-time CLI sweep and a tagging policy to catch new ones:

# Find idle EIPs in a region
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].[AllocationId,PublicIp,Tags]' \
  --output table

# Release an idle EIP
aws ec2 release-address --allocation-id eipalloc-0abc123def456

The structural fix is a Cost Explorer anomaly alert on PublicIPv4:IdleAddress in each account. Any allocation that sits idle for 48 hours should trigger a notification. We encoded this as a default in our account bootstrap Terraform so new accounts start with the alert already wired up.

The moment we broke something and reversed it

In February 2026, Production's VPC bill spiked from $270 to $16,806 in a single month. The breakdown:

Usage Type	January	February	Delta
`APS3-DataTransfer-Out-Bytes`	$59	$11,982	+$11,922
`APS3-TransitGateway-Bytes`	$14	$2,837	+$2,823
`APS3-VpcEndpoint-Bytes`	$0.22	$1,739	+$1,738

Total month-over-month increase: +$18,685 in Production alone. The culprit was security tooling: we onboarded an AWS-native security scanning platform that performs an initial full-organization asset discovery scan when activated. That initial scan traversed every account, every region, and generated the data transfer and Transit Gateway charges you see above.

We didn't reverse the tooling—it was necessary for our compliance posture. What we should have done, and what we now do for any security scanner with an "initial scan" phase: scope it to one account first, let it complete, then expand. A $500 initial scan on a single account tells you the per-account cost. An $11,922 surprise in month one tells you the same thing, less pleasantly.

The normalized spend in March and April ($18,628 and $13,838) reflects the ongoing cost of continuous scanning, which is far more predictable. The Feb spike was a one-time bulk scan, not a sustained pattern.

Where we drew the line

Three things we investigated and explicitly chose not to change:

Transit Gateway: $1,489/month across our accounts. Our Transit Gateway connects the AWS VPCs to each other, to a GCP VPN for workloads on GKE, and to Client VPN endpoints for developer access. The cost is real, but the architecture justifies it—flattening this would mean redesigning cross-cloud connectivity that's load-bearing for production traffic.

Client VPN: $337/month for developer access (endpoint hours + connection-hours). We considered SSM Session Manager as a replacement (zero per-hour cost for the endpoint). The operational tradeoff—re-training developers, losing some tooling integrations—wasn't worth the savings at our current team size. This may change as the team grows.

GuardDuty: $576/month at the org level. Non-negotiable for compliance. We did audit threat intel list subscriptions and disabled scanning for inactive accounts, which trimmed it slightly.

Saying "no" to a savings opportunity requires the same analytical rigor as saying "yes." These three were investigated, quantified, and declined.

How savings get encoded into platform defaults

The one thing that makes cost optimization durable rather than a one-time event: when you push savings into the platform layer, every new workload inherits them automatically.

The Savings Plan commitment applies to all compute in the account, regardless of which team spins up which EKS node. Per-AZ NAT gateways are a module-level change—every new VPC provisioned from our module gets the right routing. EKS Auto mode is a cluster-level setting; new deployments don't need any special annotations to benefit from flexible instance selection. IRSA patterns live in our shared Helm chart library; any new service that uses the standard chart gets the right ServiceAccount configuration.

This is the difference between cost optimization as a project (a Jira ticket, a sprint, done) and cost optimization as a platform feature (a property of the infrastructure that survives team turnover and product growth).

We spent two quarters on the project phase. Most of it was drilling through Usage Type breakdowns in Cost Explorer and writing four-line Terraform changes. The platform phase—encoding the fixes into module defaults—is what makes the savings compound.

What we didn't anticipate

Going in, we expected the Savings Plan to be the biggest win. It was meaningful, but the data-transfer costs—NAT Gateway bytes, cross-AZ regional transfer, and the Atlas egress we were paying before peering—turned out to be comparable in scale. These costs are structurally invisible: they don't show up as line items named "NAT Gateway" in the top-level Cost Explorer view. They're buried three levels deep in service → usage type.

If you take one thing from this: open Cost Explorer, filter to your biggest account, group by Service, pick your top line item, and then group by Usage Type. Do that for EC2 - Other and VPC. You will find something surprising. It took us two quarters to find and fix everything; it will take you about 20 minutes to find the first one.

Cost is a platform feature, not a cleanup project.

The team running features doesn't stop to think about data-transfer costs because there's no feedback loop connecting their architectural choices to line items on a bill. The platform team's job is to close that loop—first through visibility (attribution, showback, anomaly alerts), then through defaults (module-level choices that make the right thing the easy thing), and finally through encoding those defaults into code so they survive without enforcement.

Forty-five percent over two quarters. The next tranche of savings is already in the backlog.