Published on

Secrets Management at Scale: Killing Long-Lived AWS Credentials with OIDC and Vault

Authors

We found a 14-month-old AWS_SECRET_ACCESS_KEY hardcoded in a GitHub Actions secret. It belonged to a deploy IAM user with AdministratorAccess. The person who created it had left nine months earlier. The key had never been rotated, and nothing in our tooling would have told us it existed until it surfaced in a credential leak alert.

That was the moment we stopped treating secrets as a configuration problem and started treating them as a lifecycle problem. Static credentials don't get compromised because someone is careless on a single day. They get compromised because they sit in a .env file, a CI variable, a Slack message, and a Terraform state bucket simultaneously, and they never expire. The blast radius grows quietly until something forces you to look.

This post is about how we got to near-zero long-lived credentials across three surfaces: CI/CD pipelines, workloads running in EKS, and developers on laptops. The tools were OIDC federation and HashiCorp Vault dynamic secrets. The tools were not the hard part.


The three places credentials actually live

Before touching anything, we mapped where credentials existed and how long they lived. This is the equivalent of getting per-account cost attribution before cutting a cloud bill. Without the map you are guessing.

We found four categories:

  1. CI/CD credentials. IAM users with access keys stored as GitHub Actions secrets, plus a couple in older Jenkins jobs. These had broad permissions because nobody wanted a pipeline to fail at 2am over a missing IAM action.
  2. Workload credentials. Pods reading database passwords and third-party API keys from Kubernetes Secrets, which are base64-encoded plaintext in etcd and were frequently committed to Helm values files.
  3. Developer credentials. ~/.aws/credentials with personal access keys, plus a shared 1Password vault of API tokens that everyone copied locally.
  4. Service-to-service credentials. Database passwords, Kafka SASL credentials, and internal API tokens passed around as environment variables.

The common thread: the credential was minted once, by a human, and then lived forever. Rotation was a manual ticket that nobody filed. The fix in every case had the same shape: make the credential short-lived and tie its issuance to a verifiable identity instead of a stored secret.


CI/CD: OIDC federation instead of stored keys

The highest-leverage change was killing IAM users in CI. GitHub Actions can present an OIDC token to AWS, and AWS can be configured to trust that token and hand back temporary STS credentials. No stored secret, ever.

The trust is established by registering GitHub's OIDC provider in your AWS account once:

resource "aws_iam_openid_connect_provider" "github" {
  url            = "https://token.actions.githubusercontent.com"
  client_id_list = ["sts.amazonaws.com"]
}

Note: AWS no longer requires you to supply a thumbprint for the GitHub provider; STS validates the certificate against trusted CAs. If your provider or Terraform version still demands one, supply it, but treat the thumbprint as a moving target rather than a constant.

Then you create a role that only this specific provider, from this specific repo and ref, can assume. The condition block is where the security actually lives. Get it wrong and any repo on GitHub can assume your role.

data "aws_iam_policy_document" "deploy_assume" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    effect  = "Allow"

    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }

    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:aud"
      values   = ["sts.amazonaws.com"]
    }

    condition {
      test     = "StringLike"
      variable = "token.actions.githubusercontent.com:sub"
      values   = ["repo:matters-ai/platform:ref:refs/heads/main"]
    }
  }
}

The sub claim is the part people get wrong. If you use repo:matters-ai/platform:* you trust every branch, every pull request, and every fork-triggered workflow in that repo. A malicious PR could then assume your deploy role. Pin it to the refs that should actually be deploying, and use the environment claim for production:

values = ["repo:matters-ai/platform:environment:production"]

The workflow side becomes trivial. No secrets stanza at all:

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-deploy
          aws-region: ap-south-1
      - run: aws sts get-caller-identity

The id-token: write permission is mandatory. Without it the OIDC token is never minted and you get an opaque credentials error. We lost an afternoon to that before reading the docs.

What this bought us: every deploy now runs with credentials that live for one hour, are scoped to one role, and are tied to a specific workflow run. CloudTrail shows the assumed-role session named after the run, which ties an AWS API call back to a specific commit and actor. With static keys we had deploy-bot doing everything and no way to attribute anything.

The scoping discipline that comes with it

Moving to OIDC forced a conversation we had been avoiding: what does this pipeline actually need to do? When the credential was a permanent admin key, nobody questioned it. When you write the role policy from scratch you confront the real permission set.

We ended up with separate roles per pipeline stage. The Terraform plan stage gets read-only plus state access. The apply stage gets write permissions, assumable only from the protected branch. The container build stage gets ECR push and nothing else.

resource "aws_iam_role" "ecr_push" {
  name               = "github-ecr-push"
  assume_role_policy = data.aws_iam_policy_document.build_assume.json
}

resource "aws_iam_role_policy" "ecr_push" {
  role = aws_iam_role.ecr_push.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload"
      ]
      Resource = "*"
    }]
  })
}

This is more roles to manage, but each one is legible. When a security review asks what a pipeline can do, the answer is a fifteen-line policy, not a shrug.


Workloads: IRSA for AWS, Vault for everything else

Pods that talk to AWS got the same treatment as CI, through IAM Roles for Service Accounts (IRSA). EKS runs its own OIDC provider, and you bind a Kubernetes service account to an IAM role through an annotation.

resource "aws_iam_role" "scan_agent" {
  name = "scan-agent"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${local.eks_oidc}:sub" = "system:serviceaccount:dspm:scan-agent"
          "${local.eks_oidc}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: scan-agent
  namespace: dspm
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/scan-agent

The pod's SDK then transparently picks up temporary credentials. No keys mounted, no Kubernetes Secret holding an AWS key. The credential rotates automatically and is scoped to that one service account.

IRSA solves AWS access. It does nothing for the database password, cloud provider credentials, or the Kafka SASL credentials. That is where Vault earns its place.

Dynamic database credentials

The pattern that changed how we think about secrets was Vault's database secrets engine. Instead of a static password that every pod shares and that we rotate never, Vault creates a unique database user on demand, with a TTL, and revokes it when the lease expires.

Configure the engine once with an admin credential that only Vault holds:

vault secrets enable database

vault write database/config/assets-db \
  plugin_name=postgresql-database-plugin \
  allowed_roles="scanner-ro,scanner-rw" \
  connection_url="postgresql://{{username}}:{{password}}@assets-db.internal:5432/assets" \
  username="vault_admin" \
  password="$ADMIN_PW"

vault write database/roles/scanner-rw \
  db_name=assets-db \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
    GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"

When the scan agent starts, it asks Vault for credentials:

vault read database/creds/scanner-rw



Key                Value
---                -----
lease_id           database/creds/scanner-rw/aBcDeF123
lease_duration     1h
password           A1a-randomly-generated
username           v-token-scanner-x7y2

Every pod gets its own database user, valid for an hour, renewed while the pod runs, and deleted by Vault when the lease ends. If a pod is compromised, the attacker steals a credential that dies in under an hour and is traceable to that exact workload in the Postgres logs. We went from one shared password that we were terrified to rotate to ephemeral users that rotate themselves.

The trade-off is real: you now depend on Vault being available for new pods to get database access. We run Vault in HA with integrated Raft storage across three nodes and three availability zones, and we set lease TTLs long enough (1h with renewal) that a brief Vault outage doesn't immediately take down running workloads. Pods that already hold a valid lease keep working; only new leases and renewals block. That is an acceptable failure mode, but you have to design for it.

How pods authenticate to Vault

There is a chicken-and-egg problem: to get a secret from Vault, the pod needs to authenticate to Vault, which requires a credential. We solve it with the Kubernetes auth method, which uses the pod's service account token as the identity. Vault verifies that token against the cluster's TokenReview API.

vault auth enable kubernetes

vault write auth/kubernetes/config \
  kubernetes_host="https://kubernetes.default.svc"

vault write auth/kubernetes/role/scan-agent \
  bound_service_account_names=scan-agent \
  bound_service_account_namespaces=dspm \
  policies=scanner-secrets \
  ttl=1h

The pod's identity is the service account, issued by Kubernetes and verified by Kubernetes. No bootstrap secret to manage. The Vault Agent Injector mutating webhook handles the actual login and secret rendering, so application code doesn't need a Vault SDK:

annotations:
  vault.hashicorp.com/agent-inject: "true"
  vault.hashicorp.com/role: "scan-agent"
  vault.hashicorp.com/agent-inject-secret-db: "database/creds/scanner-rw"
  vault.hashicorp.com/agent-inject-template-db: |
    {{- with secret "database/creds/scanner-rw" -}}
    DATABASE_URL=postgresql://{{ .Data.username }}:{{ .Data.password }}@assets-db.internal:5432/assets
    {{- end -}}

The injected sidecar writes the rendered file to a shared in-memory volume and keeps the lease renewed. The application reads a file. It never knows Vault exists. That last property mattered for adoption: teams did not have to rewrite anything to onboard.


Developers: short-lived everything

The hardest surface was developer laptops, because you cannot mandate behavior change without making the secure path easier than the insecure one. Telling people to stop putting access keys in ~/.aws/credentials does nothing if the secure alternative is more typing.

We standardized on AWS IAM Identity Center (formerly SSO) for console and CLI access. The CLI config uses SSO sessions, so the credential is a short-lived token tied to the user's SSO login, not a permanent key:

[sso-session matters]
sso_start_url = https://matters.awsapps.com/start
sso_region = ap-south-1
sso_registration_scopes = sso:account:access

[profile dev]
sso_session = matters
sso_account_id = 123456789012
sso_role_name = Developer
region = ap-south-1
aws sso login --profile dev

This pops a browser, authenticates against our IdP, and caches a token that expires. No access key ever lands on disk. When someone leaves, deactivating their IdP account kills all their AWS access immediately. With static keys, offboarding meant hunting through IAM for users that person might have created, which we now know we did badly.

For application secrets developers genuinely need locally, like a sandbox API key for testing, they pull from Vault with their own short-lived token, authenticated through the same OIDC IdP:

vault login -method=oidc
vault read sandbox/data/cloud-creds-test

The Vault policy scopes developers to sandbox paths only. They physically cannot read production secrets, which removes the temptation to copy a production credential locally to debug something. If you make the production secret unreachable, you have eliminated a whole class of incident.


The migration was the hard part

The technology decisions above took maybe two weeks of design. The migration took two quarters, because you cannot flip a switch on a running platform. Every existing credential had a consumer, and you have to cut over each consumer without breaking it.

We ran it as a strangler migration. The sequence that worked:

  1. Inventory and tag. We tracked every IAM access key with a migrated flag and instrumented CloudTrail to alert on any use of a non-migrated key. This gave us ground truth on what was actually used versus what was sitting dormant. About a third of our access keys had not been used in 90 days. Those we disabled first and waited for screaming. Almost none came.
  2. Build the new path alongside the old. For each workload, we stood up the IRSA role or Vault role and deployed it to staging while production still used the old key. We verified the new path under load before touching production.
  3. Cut over and watch. We switched production to the new path with the old key still active as a fallback. CloudTrail told us whether anything still used the old key. After 72 hours of zero usage, we disabled the key but did not delete it.
  4. Delete after a soak period. Two weeks after disabling with no incident, we deleted the key for good.

The alert on dormant-but-existing keys was the most valuable instrument. A disabled key that nothing uses is a measurement, not a guess. We were not asking "is it safe to remove this?" We were watching whether anything depended on it, with logs.

The detection that keeps it from regressing

Killing the credentials once is worthless if they grow back. People create access keys under deadline pressure. We added two guardrails.

First, a Service Control Policy at the Organization level that denies creation of IAM access keys for anything except a small allowlist of break-glass identities:

{
  "Sid": "DenyAccessKeyCreation",
  "Effect": "Deny",
  "Action": "iam:CreateAccessKey",
  "Resource": "*",
  "Condition": {
    "StringNotLike": {
      "aws:PrincipalArn": "arn:aws:iam::*:role/break-glass-*"
    }
  }
}

An SCP cannot be overridden by an account admin, which is exactly why it works. The person under deadline pressure literally cannot create the key. They have to use the federated path because it is the only path that works.

Second, a nightly job that scans for any IAM access keys in production accounts and any key older than 90 days, then posts the results to a security channel with the creating principal named. Detection without attribution is noise. We name the owner.


What we got wrong

We initially set Vault database lease TTLs to 15 minutes, reasoning that shorter is more secure. It nearly took down a asset scan job that ran for 40 minutes and held a connection pool. When the lease expired mid-job, Postgres revoked the user and in-flight queries on those connections failed. Short TTLs are good until they collide with long-running work. We moved to a 1h default with 24h max and renewal, and made sure the Vault Agent sidecar renewed leases before expiry. The lesson: the credential lifetime has to exceed the longest unit of work that holds it, not the average.

We also underestimated how much teams relied on copying production secrets to debug. When we cut off that access, we got a wave of "how am I supposed to reproduce this bug?" The honest answer was that they had been debugging against production data on their laptops, which was the exact thing we were trying to stop. We had to build proper sanitized staging data and better observability so they did not need production secrets to diagnose production issues. Removing the credential exposed a workflow problem that the credential had been papering over.

Finally, Vault HA is operationally non-trivial. Auto-unseal with a KMS key, Raft snapshots to S3, and a tested restore procedure are not optional. We treat Vault as a tier-zero dependency, on the same footing as DNS and the EKS control plane, because a Vault outage now blocks new workloads from starting. That is a deliberate trade we accepted to get static secrets out of the system, but it is a trade, and you should make it consciously.


The number that mattered

When we started, we had 47 active IAM access keys across the organization and an unknown number of static secrets in Kubernetes and CI. When we finished, we had four break-glass keys, stored offline, used zero times, and alerted on every use. Every other credential in the system is short-lived and tied to a verifiable identity: an OIDC token from GitHub, a service account in EKS, or an SSO session from a developer.

The security win is obvious, but the operational win surprised us more. Offboarding a person is now one action in the IdP. Rotating a database password is a Vault config change that propagates as leases expire. An audit request that used to mean a week of grep across repos is now a query against CloudTrail and Vault audit logs that returns in minutes. We removed an entire category of work, the manual rotation ticket that nobody filed, by making the system rotate itself.

The credential you never have to rotate is the one that expires on its own. Build for issuance tied to identity, keep lifetimes just longer than your longest unit of work, and put a hard policy fence around the old way so it cannot grow back. The static key in a CI secret is not a configuration mistake. It is a lifecycle you forgot to design.