Published on

Idempotency Keys for Payment APIs: Surviving Retries, Timeouts, and Double-Charges

Authors

A customer got charged twice for the same $4,200 invoice. The mobile app had timed out after 30 seconds, shown a spinner that never resolved, and the user tapped Pay again. Both requests reached our payment service. Both succeeded against the gateway. Our support queue lit up, and the refund took four business days to clear because it crossed a settlement boundary.

The request was not malformed. The gateway was not down. Our code did exactly what it was told: charge the card. Twice. The bug was not in any single line of code. It was in the absence of a contract that says "if you have seen this request before, do not do the work again, return the original answer."

That contract is idempotency, and getting it right is harder than the textbook version implies. The textbook version says "store a key, check if it exists, return the cached response." That handles the happy path and nothing else. It does not handle two requests with the same key arriving 5 milliseconds apart. It does not handle a request that crashes halfway through. It does not handle a client that reuses a key with a different request body. This post is about the version that handles all three, built on Postgres, running in production behind a fintech ledger.


Why "at least once" is the only delivery guarantee you get

Every layer between a user's thumb and your database can retry. The mobile network retries. The load balancer retries on connection reset. The client SDK retries on 5xx. Your own service mesh retries on timeout. None of these layers know whether the previous attempt actually completed, because the failure they observed was the loss of a response, not the loss of the work.

This is the fundamental asymmetry of distributed systems: a timeout tells you nothing. When a request to POST /payments times out, the work may have fully succeeded, partially succeeded, or never started. The caller cannot distinguish these cases from the outside. So a correct caller must retry, and a correct server must be able to absorb that retry without doing the work twice.

People reach for "exactly once delivery" as if it were a setting you can enable. It is not. What you can build is at-least-once delivery plus idempotent processing, and the combination is observationally equivalent to exactly once. The delivery layer keeps trying until it gets an answer. The processing layer guarantees that trying ten times has the same effect as trying once. Idempotency is the second half of that pair, and it is the half you own.

For read operations this is free. GET is idempotent by definition. The problem is the operations that move money, mutate balances, and call external gateways: POST and sometimes PATCH. Those are the ones a naive retry turns into a double-charge.


The contract: what an idempotency key actually promises

We settled on a contract borrowed loosely from Stripe's API, because it had survived years of production abuse and our clients already understood it. The client generates a key, sends it in a header, and the server promises:

  1. The same key with the same request body returns the same response, exactly once worth of side effects.
  2. The same key with a different request body is an error, not a silent overwrite.
  3. Two concurrent requests with the same key never both execute the work. One wins, the other waits or gets told to retry.
  4. A request that crashes mid-flight does not leave a poisoned key that blocks all future retries.

That fourth point is where most homegrown implementations fall apart. They store the key at the start, the handler crashes, and now every retry sees the key and returns a cached response that was never written. The customer's payment is stuck forever in a state that looks complete but is not.

The header looks like this:

POST /v1/payments HTTP/1.1
Content-Type: application/json
Idempotency-Key: 7c9e6679-7425-40de-944b-e07fc1f90ae7

{"invoice_id": "inv_8812", "amount_cents": 420000, "currency": "USD"}

The key is a client-generated UUID, scoped per endpoint and per authenticated account. We do not let one account's key collide with another's, and we do not let a key for /payments be reused on /refunds. The scoping matters: a single global key namespace sounds cleaner but makes debugging miserable and creates a tempting target for cross-tenant replay.


The schema, and why the request fingerprint earns its column

Here is the table we landed on after two rewrites:

CREATE TABLE idempotency_keys (
    id              BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    account_id      BIGINT      NOT NULL,
    endpoint        TEXT        NOT NULL,
    idempotency_key TEXT        NOT NULL,
    request_hash    TEXT        NOT NULL,
    status          TEXT        NOT NULL DEFAULT 'in_progress',
    response_code   INT,
    response_body   JSONB,
    locked_at       TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at    TIMESTAMPTZ,
    expires_at      TIMESTAMPTZ NOT NULL,

    CONSTRAINT uq_idempotency UNIQUE (account_id, endpoint, idempotency_key),
    CONSTRAINT chk_status CHECK (status IN ('in_progress', 'completed', 'failed'))
);

CREATE INDEX idx_idempotency_expiry ON idempotency_keys (expires_at);

The request_hash column is the one people skip and regret. It is a SHA-256 of the canonicalized request body. Without it, you cannot detect contract violation number two: a client that reuses a key with a different payload. That happens more than you would think, usually because of a buggy client that generates one key per screen load and then lets the user edit the amount before submitting. If you ignore the body, you return the cached response for the old amount, and the customer swears they paid $50 but your records say $30.

Canonicalization matters because JSON is not byte-stable. Key order, whitespace, and number formatting all vary. We canonicalize before hashing:

import hashlib
import json

def request_fingerprint(body: dict) -> str:
    canonical = json.dumps(
        body,
        sort_keys=True,
        separators=(",", ":"),
        ensure_ascii=True,
    )
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

The UNIQUE constraint on (account_id, endpoint, idempotency_key) is doing the heavy lifting for concurrency. We do not check-then-insert in application code, because that is a textbook race. We insert and let the database reject the duplicate. The constraint is the source of truth, and the database serializes the contention for us.

The status column has three values and they are load-bearing. in_progress means a request is currently executing under this key. completed means the work finished and response_body holds the answer. failed means the work terminated in a way that is safe to retry. The failed state is what fixes the poisoned-key problem.


The flow, step by step

The request handler does six things in order. Each one matters.

Step 1: Try to claim the key

INSERT INTO idempotency_keys
    (account_id, endpoint, idempotency_key, request_hash, status, locked_at, expires_at)
VALUES
    ($1, $2, $3, $4, 'in_progress', now(), now() + interval '24 hours')
ON CONFLICT (account_id, endpoint, idempotency_key) DO NOTHING
RETURNING id;

If this returns a row, you won the race. You hold the claim. Proceed to do the work.

If it returns nothing, the key already exists. Somebody else got there first, or this is a retry of your own earlier request. You now have to read what state that existing key is in.

This INSERT ... ON CONFLICT DO NOTHING is atomic. There is no window where two requests both see "key does not exist" and both proceed. The database guarantees exactly one of them gets the RETURNING row. This is why we do not use a SELECT then INSERT pattern; the gap between them is precisely where double-charges breed.

Step 2: If the key already exists, branch on status

SELECT status, request_hash, response_code, response_body, locked_at
FROM idempotency_keys
WHERE account_id = $1 AND endpoint = $2 AND idempotency_key = $3;

Three outcomes:

  • completed: validate the request_hash matches the incoming request. If it matches, return the stored response_code and response_body verbatim. If it does not match, return 422 Unprocessable Entity with an error explaining the key was reused with a different body. Do not do the work.
  • in_progress: another request is currently executing. Check locked_at. If the lease is fresh, return 409 Conflict and tell the client to retry shortly. If the lease is stale (the holder crashed), reclaim it, covered below.
  • failed: the previous attempt failed in a retryable way. Reset the row to in_progress, take a fresh lease, and execute as if new.

Step 3: Validate the request hash before doing anything

We store the hash on the winning path so step 2's comparison always has something to check against. The hash check on completed is what enforces contract number two. We return a structured error:

{
  "error": {
    "type": "idempotency_key_reuse",
    "message": "This Idempotency-Key was previously used with a different request body."
  }
}

Step 4: Do the work inside a transaction boundary you control

This is the subtle part. The idempotency row and the business side effect must commit together, or you must accept that they cannot and design for it. For purely internal work like a ledger entry, wrap them:

BEGIN;

-- the business work
INSERT INTO ledger_entries (account_id, invoice_id, amount_cents, ...)
VALUES ($1, $2, $3, ...);

-- mark the key completed in the same transaction
UPDATE idempotency_keys
SET status = 'completed',
    response_code = 201,
    response_body = $4,
    completed_at = now()
WHERE account_id = $5 AND endpoint = $6 AND idempotency_key = $7;

COMMIT;

If this transaction commits, both the ledger entry and the completed key are durable together. If it rolls back, neither happened, and the key stays in_progress until reclaimed. No half-states. The external-gateway case, where the side effect cannot share this transaction, is covered below.

Step 5: On any failure, mark the key failed (do not delete it)

The instinct is to DELETE the key on failure so the retry starts clean. Resist it. Deleting loses the audit trail and reopens a window where a concurrent retry inserts fresh and runs in parallel. Instead, mark it failed:

UPDATE idempotency_keys
SET status = 'failed', locked_at = NULL
WHERE account_id = $1 AND endpoint = $2 AND idempotency_key = $3;

A retry then sees failed and re-runs (step 2's third branch). The row is reused, the history is preserved.

Step 6: Return

Whatever the handler produced becomes the stored response. Subsequent retries with the same key and body get this exact response. The client cannot tell whether it hit the original execution or a replay, which is the whole point.


The crash recovery problem

Here is the failure that breaks naive implementations. A request claims the key, sets status in_progress, starts the work, and then the pod gets OOM-killed. The work transaction never commits, but the claim might already be durable.

If the crash happened before the INSERT ... ON CONFLICT committed, there is no row, and a retry starts clean. Fine.

If the crash happened after the claim committed but before the work transaction committed, the row sits at in_progress forever. Every retry sees in_progress, returns 409 Conflict, and the client retries into a wall. The payment is now permanently stuck, and no amount of retrying will resolve it.

The fix is a lease, not a lock. The locked_at timestamp is a lease with an expiry. When a retry sees in_progress, it checks how old the lease is:

UPDATE idempotency_keys
SET locked_at = now()
WHERE account_id = $1 AND endpoint = $2 AND idempotency_key = $3
  AND status = 'in_progress'
  AND locked_at < now() - interval '60 seconds'
RETURNING id;

If this returns a row, the previous holder's lease expired and we have reclaimed it. We can now safely re-execute. If it returns nothing, either the lease is still fresh (a live request is working) or the status changed while we were checking. Either way we re-read and branch again.

The lease duration is a tuning decision. Too short and you reclaim a key from a request that is merely slow, causing genuine parallel execution. Too long and a crashed request blocks retries for that whole duration. We set it slightly longer than our hardest request timeout. If a single payment can take up to 30 seconds because the gateway is slow, the lease has to outlast 30 seconds with margin. We use 60.


The external side effect: when you cannot use a transaction

Everything above assumes the side effect lives in your database, where it can share a transaction with the idempotency row. Payments break that assumption. Charging a card means an HTTP call to Stripe or Adyen, which a Postgres ROLLBACK cannot undo. If you charge the card and then your commit fails, you have taken the money and lost the record.

The answer is that the external gateway must also be idempotent, and you must propagate your idempotency through to it. Every serious payment gateway supports this. Stripe takes an Idempotency-Key on its own API. So we derive the gateway key deterministically from our key:

import hashlib

def gateway_idempotency_key(our_key: str, account_id: int) -> str:
    seed = f"{account_id}:{our_key}:charge"
    return hashlib.sha256(seed.encode()).hexdigest()

Now the flow is a chain of idempotent steps:

  1. Claim our key (in_progress).
  2. Call the gateway with the derived gateway key. If we already called it during a prior crashed attempt, the gateway recognizes its key and returns the same charge rather than creating a new one. No double-charge at the gateway, even if our service crashed and retried.
  3. Record the gateway's charge ID and mark our key completed, in one transaction.

The critical ordering: call the gateway before committing your local state, and rely on the gateway's idempotency to make the call replay-safe. If step 3 crashes after the gateway charged, the retry re-calls the gateway, gets the same charge back, and this time succeeds in committing. The charge happened once; your record catches up.

This is the part teams get wrong by trying to be clever. They commit a "pending" row first, then call the gateway, then update to "complete." That creates a window where a pending row exists with no confirmed charge, and reconciliation has to guess. The cleaner model: the gateway is the source of truth for whether the money moved, your idempotency key chains to it, and you only ever record what the gateway confirms.

def process_payment(account_id, our_key, request_body):
    fingerprint = request_fingerprint(request_body)

    claimed = claim_key(account_id, "/payments", our_key, fingerprint)
    if not claimed:
        return handle_existing_key(account_id, our_key, fingerprint)

    try:
        gw_key = gateway_idempotency_key(our_key, account_id)
        charge = stripe.PaymentIntent.create(
            amount=request_body["amount_cents"],
            currency=request_body["currency"],
            idempotency_key=gw_key,
        )
        response = {"charge_id": charge.id, "status": charge.status}
        complete_key(account_id, "/payments", our_key, 201, response)
        return 201, response
    except RetryableGatewayError:
        fail_key(account_id, "/payments", our_key)
        return 503, {"error": "gateway_unavailable"}
    except PermanentGatewayError as e:
        # a card decline is a real, final outcome, store it as completed
        response = {"status": "declined", "reason": e.code}
        complete_key(account_id, "/payments", our_key, 402, response)
        return 402, response

Note the distinction between RetryableGatewayError and PermanentGatewayError. A timeout or a 503 from the gateway is retryable, so we mark failed and let the client try again. A card decline is a permanent and valid outcome, so we mark it completed with a 402. A retry of a declined card with the same key returns the same decline, which is correct. The customer's card is still declined; retrying should not silently re-attempt it.


Concurrent requests with the same key, exactly

Let me trace the worst case explicitly because it is the one that keeps you up at night. Two requests, same idempotency key, arrive 2 milliseconds apart on two different pods.

Both run INSERT ... ON CONFLICT DO NOTHING. The database serializes these on the unique constraint. Exactly one gets the RETURNING row. Say pod A wins. Pod A proceeds to call the gateway.

Pod B's insert conflicts and returns no row. Pod B reads the existing row and sees in_progress with a fresh locked_at. Pod B returns 409 Conflict to its client: a request with this key is already being processed, retry shortly.

The client behind pod B retries after a short backoff. By then, pod A has likely finished and marked the key completed. Pod B's retry reads completed, validates the hash, and returns the stored response. The client gets the right answer. The gateway was called exactly once. No double-charge.

The 409 is not a failure; it is flow control. It tells the client to slow down because an earlier request is in flight. Clients must treat 409 on an idempotent endpoint as retryable with backoff, not as an error to surface to the user. We document this loudly in the API spec, because a client that surfaces the 409 as "payment failed" creates exactly the confused-user retry storm we were trying to eliminate.


Expiry and cleanup

Idempotency keys are not forever. We expire them after 24 hours, which comfortably exceeds any reasonable client retry window. A client that retries a day later is not retrying; it is making a new request and should generate a new key.

The expires_at column and its index let a background job sweep:

DELETE FROM idempotency_keys
WHERE id IN (
    SELECT id FROM idempotency_keys
    WHERE expires_at < now()
      AND status IN ('completed', 'failed')
    LIMIT 10000
);

We never sweep in_progress rows on expiry. An expired in_progress row means a request crashed and never resolved, and that is a signal worth investigating, not garbage to delete. We alert on in_progress rows older than an hour. In practice these are almost always a downstream dependency that hung in a way our timeouts did not catch.

We run cleanup in bounded batches to avoid a single giant DELETE that locks the table and bloats the WAL. At our volume, roughly 400,000 keys a day, a batched delete every few minutes keeps the table small enough that the unique-constraint lookups stay on a hot index.


Why we did not use Redis for this

The obvious objection: this is all just a key-value lookup, so why not Redis with a TTL? We tried. Redis is faster on the read path and the TTL gives you expiry for free. But Redis cost us the one thing we needed most: a transactional boundary shared with the business write.

When the idempotency record lives in Redis and the ledger entry lives in Postgres, you have a two-system commit problem. The ledger write succeeds, then the Redis write fails, and now your key says "never happened" while your ledger says it did. Retries re-run the ledger write. You are back to double-entry. You can paper over it with Lua scripts and careful ordering, but you are reconstructing transactionality that Postgres gives you for free.

There is a real cost to the Postgres approach. It adds an insert and an update to the hot path of every mutating request, and the unique-constraint contention is a genuine serialization point under concurrent retries. At our scale that cost is single-digit milliseconds and entirely worth it. If you are processing a million idempotent writes a second, the calculus changes and you build a sharded, purpose-built store. We are not, and neither are most fintechs. Reaching for Redis here optimizes a latency budget you already have at the expense of a correctness guarantee you cannot afford to lose.


Testing the cases that actually fail

The happy path is trivial to test and tells you nothing. The tests that matter simulate the failures:

def test_concurrent_same_key_charges_once(self):
    key = str(uuid.uuid4())
    body = {"invoice_id": "inv_1", "amount_cents": 5000, "currency": "USD"}

    with ThreadPoolExecutor(max_workers=2) as ex:
        f1 = ex.submit(self.client.pay, key, body)
        f2 = ex.submit(self.client.pay, key, body)
        r1, r2 = f1.result(), f2.result()

    # exactly one gateway charge created
    assert self.fake_gateway.charge_count == 1
    # both responses agree, or one is a 409 that resolves on retry
    statuses = sorted([r1.status_code, r2.status_code])
    assert statuses in ([201, 201], [201, 409])

We also test the crash-recovery lease by inserting an in_progress row with a stale locked_at and asserting the next request reclaims it. And we test the body-mismatch path by reusing a completed key with a changed amount and asserting a 422.

The single most valuable test injects a crash between the gateway call and the local commit, then retries, and asserts the gateway was called twice with the same idempotency key but only one charge resulted. That test is the entire reason the system exists, encoded as an assertion.

def test_crash_after_gateway_before_commit(self):
    key = str(uuid.uuid4())
    body = {"invoice_id": "inv_2", "amount_cents": 7000, "currency": "USD"}

    # first attempt: gateway succeeds, commit is forced to fail
    with self.force_commit_failure():
        with pytest.raises(CommitError):
            self.service.pay(key, body)

    # gateway saw one charge, our key is failed
    assert self.fake_gateway.charge_count == 1
    assert self.key_status(key) == "failed"

    # retry: re-calls gateway with same gw key, gets same charge, commits
    r = self.service.pay(key, body)
    assert r.status_code == 201
    assert self.fake_gateway.charge_count == 1  # still one, not two
    assert self.key_status(key) == "completed"

What this bought us

Duplicate-charge support tickets went to zero across the next two quarters. The reconciliation engine got simpler because there were no longer phantom pending states to chase. And we could finally turn on aggressive client-side retries, which improved our success rate under flaky mobile networks, because we trusted that retries were safe.

The code is not glamorous. It is one table, a unique constraint, a lease timestamp, and a discipline about transaction boundaries. The hard part was never the SQL. The hard part was internalizing that a timeout tells you nothing, that the network will retry whether you designed for it or not, and that the only defense is to make doing the work twice indistinguishable from doing it once.

If you take one thing from this: the unique constraint is your concurrency control, the lease is your crash recovery, and the request hash is your guard against clients reusing keys carelessly. Skip any of the three and you have an idempotency layer that works in the demo and double-charges someone in week three.