Published on

Outbox, Not Dual Writes: Getting Events and Database State to Agree

Authors

A ledger entry got written. The downstream notification never fired. The customer's balance updated in our database, our reconciliation dashboard showed the transaction as settled, but the webhook that tells the merchant "funds received" never went out. The merchant called support, support escalated, and an engineer spent two hours staring at logs trying to figure out why a row existed in transactions but no corresponding message existed in the payment-events Kafka topic.

The code looked correct. It did two things in sequence: write the transaction to Postgres, then publish an event to Kafka. Both calls sat in the same method, wrapped in a sensible-looking try block. The Postgres write committed. The Kafka publish threw a TimeoutException because a broker was mid-restart during a rolling deploy. The exception was logged and swallowed by a generic handler three frames up. The database said one thing, the event stream said another, and nothing in the system noticed the disagreement.

This is the dual write problem, and it is one of the most common ways event-driven systems quietly lie to themselves. This post is about why dual writes cannot be made reliable no matter how carefully you order them, and how the transactional outbox pattern fixes it. The version here runs in production behind a fintech ledger on Postgres and Kafka.


The two writes that can never be atomic

The pattern that breaks looks like this:

fun recordPayment(payment: Payment) {
    transactionRepository.save(payment.toRow())   // write 1: Postgres
    kafkaTemplate.send("payment-events", payment.toEvent())  // write 2: Kafka
}

Two different systems. Two different commit protocols. No shared transaction. There are four possible outcomes when this method runs, and only one of them is correct:

  1. Postgres commits, Kafka publishes. Correct.
  2. Postgres fails, Kafka never runs. Consistent, the operation simply did not happen.
  3. Postgres commits, Kafka fails. The database has the truth, the event stream is missing it. Lost event.
  4. Kafka publishes, Postgres rolls back. The event stream announces something that never happened. Phantom event.

Outcomes 3 and 4 are the killers, and you cannot eliminate them by reordering the two writes. Publish to Kafka first and you trade lost events for phantom events. Write to Postgres first and you trade phantom events for lost events. No ordering makes both writes succeed or fail together, because they are not in the same transaction and cannot be.

The instinct is to reach for distributed transactions: two-phase commit across Postgres and Kafka. Resist it. Kafka's transaction support is built for exactly-once between Kafka topics, not for enrolling an external database as an XA participant. Even where you can bolt on a 2PC coordinator, it becomes a single point of failure, and the prepare-commit round trip across two systems on every write is latency a payment path should not absorb. Worse, 2PC blocks: if the coordinator dies between prepare and commit, participants hold locks until it recovers. That is the opposite of what you want in a system that needs to stay up.

The asymmetry that makes this hard is the same one behind retries and idempotency: a failure tells you nothing about whether the work completed. When kafkaTemplate.send throws a timeout, you do not know if the broker received the message and the ack was lost, or if the message never arrived. You cannot safely retry without risking a duplicate, and you cannot safely skip without risking a loss.


The insight: make the event part of the database transaction

The outbox pattern sidesteps the whole problem by refusing to do two writes to two systems. Instead you do one write to one system: the business state and the intent to publish an event go into the same database, in the same transaction.

@Transactional
fun recordPayment(payment: Payment) {
    transactionRepository.save(payment.toRow())
    outboxRepository.save(payment.toOutboxRecord())  // same transaction
}

Both writes are now governed by Postgres transaction semantics. They commit together or roll back together. There is no window where one exists and the other does not. The four-outcome problem collapses to two: either the payment and its outbox record both exist, or neither does.

A separate process reads the outbox table and publishes its rows to Kafka. That process can retry as aggressively as it wants, because the outbox row is durable. If Kafka is down, the row stays in the table. When Kafka recovers, the row gets published. The event is never lost, because its existence is guaranteed by the same commit that guaranteed the business state.

The trade is explicit: you give up synchronous delivery. The event is not published the instant the transaction commits. There is a delay, usually small, between the commit and the publish. In exchange you get a guarantee that the event will eventually be published as long as the business state exists. For a ledger, that trade is obviously correct. A few hundred milliseconds of publish latency is nothing compared to a silently lost settlement event.


The outbox table

Start with the schema. The shape matters more than it looks.

CREATE TABLE outbox (
    id              BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    aggregate_type  TEXT        NOT NULL,
    aggregate_id    TEXT        NOT NULL,
    event_type      TEXT        NOT NULL,
    payload         JSONB       NOT NULL,
    headers         JSONB       NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at    TIMESTAMPTZ,
    attempts        INT         NOT NULL DEFAULT 0
);

CREATE INDEX idx_outbox_unpublished
    ON outbox (id)
    WHERE published_at IS NULL;

A few decisions are baked in here.

The id is a monotonic identity column, not a UUID. The relay reads in id order, and a sequential integer gives you a cheap, ordered cursor.

The aggregate_type and aggregate_id exist so the routing logic can key messages correctly. In Kafka the partition key gives you per-key ordering. If two events for the same account land in different partitions, consumers can process them out of order. The aggregate_id becomes the partition key, keeping all events for one account on one partition and therefore in order.

The partial index on published_at IS NULL is the most important line in the schema. The relay's hot query is "give me the oldest unpublished rows." A full index on id would force the planner to walk past published rows. The partial index only contains rows that still need work, so it shrinks as you publish and stays tiny even when the table itself holds tens of millions of historical rows.

The attempts column is for backoff and poison-message detection. A row that has failed to publish 50 times is telling you something is wrong with that specific message, not with Kafka.

Notice there is no status enum with values like PENDING and PUBLISHED. A nullable published_at does double duty: it tells you whether the row is published and when. Fewer columns, fewer states to reason about.


Whether to delete or keep published rows

You have to decide what happens to a row after it is published. Two camps.

Delete it. The outbox stays small and queries stay fast, since the table only ever holds unpublished rows. The downside is you lose the audit trail. If a downstream consumer claims it never got an event, you have nothing to point to.

Keep it and set published_at. You retain a complete record of every event the system emitted, which is gold for debugging and reconciliation. The downside is the table grows forever unless you prune it.

We keep rows and prune on a schedule. The outbox doubles as an event log, and in a fintech system the ability to answer "did we emit a settlement event for transaction X, and exactly when" is worth the storage. Pruning is a partitioned-table problem: partition by month, drop old partitions, never run a DELETE that scans the whole table.

CREATE TABLE outbox (
    id              BIGINT GENERATED ALWAYS AS IDENTITY,
    aggregate_type  TEXT        NOT NULL,
    aggregate_id    TEXT        NOT NULL,
    event_type      TEXT        NOT NULL,
    payload         JSONB       NOT NULL,
    headers         JSONB       NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at    TIMESTAMPTZ,
    attempts        INT         NOT NULL DEFAULT 0
) PARTITION BY RANGE (created_at);

Note that a partitioned table needs the partition key in the primary key, so the PRIMARY KEY on id from the non-partitioned version becomes a plain identity column here, with uniqueness enforced per partition. Dropping a partition is a metadata operation that completes in milliseconds. A DELETE FROM outbox WHERE created_at < now() - interval '30 days' on a large table generates dead tuples, bloats the table, and makes autovacuum miserable. Use partitions.


The relay: two ways to get rows out

Something has to read the outbox and publish to Kafka. There are two fundamentally different ways to do this, and the choice has real operational consequences.

Polling the table

The simplest relay is a loop that polls for unpublished rows, publishes them, and marks them published.

@Scheduled(fixedDelay = 200)
@Transactional
fun relay() {
    val batch = jdbcTemplate.query(
        """
        SELECT id, aggregate_id, event_type, payload, headers
        FROM outbox
        WHERE published_at IS NULL
        ORDER BY id
        LIMIT 500
        FOR UPDATE SKIP LOCKED
        """,
        outboxRowMapper
    )

    for (row in batch) {
        val record = ProducerRecord(
            "payment-events",
            row.aggregateId,        // partition key
            row.payload
        )
        kafkaProducer.send(record).get()  // block until acked
        jdbcTemplate.update(
            "UPDATE outbox SET published_at = now() WHERE id = ?",
            row.id
        )
    }
}

The critical clause is FOR UPDATE SKIP LOCKED, and the whole method runs in one transaction so the row locks are held until commit. If you run more than one relay instance for availability, two pollers will otherwise grab the same rows and double-publish. FOR UPDATE locks the selected rows so another transaction cannot grab them. SKIP LOCKED tells the second poller to skip rows that are already locked and move on to the next available batch rather than blocking. Together they let you run N relay instances that each pull a disjoint set of rows without coordination.

Polling is easy to reason about, easy to deploy, and has one real weakness: a latency-versus-load trade-off. Poll every 200ms and you add up to 200ms of publish latency while hammering the database with queries that mostly return nothing. Poll every 5 seconds and you cut the query load but add 5 seconds of latency. There is no setting that is both low-latency and low-load. For a lot of systems polling is completely fine, and you should start here before reaching for anything fancier.

One refinement removes most of the wasted polling: Postgres LISTEN/NOTIFY. Fire a NOTIFY from the transaction that inserts into the outbox, and have the relay LISTEN on that channel. The relay sleeps until notified, then polls. You get near-instant publish on the happy path and a slow fallback poll as a safety net.

CREATE OR REPLACE FUNCTION notify_outbox() RETURNS trigger AS $$
BEGIN
    PERFORM pg_notify('outbox', NEW.id::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER outbox_notify
    AFTER INSERT ON outbox
    FOR EACH ROW EXECUTE FUNCTION notify_outbox();

Do not rely on NOTIFY alone. Notifications are not durable. If the relay is down when the notify fires, it never sees it, and on a partitioned table the trigger must be defined on each partition. The fallback poll catches anything the notify missed. NOTIFY is an optimization for latency; the poll is the correctness guarantee.

Change data capture with Debezium

The other approach reads the Postgres write-ahead log directly. Debezium is a connector that tails the WAL, turns row changes into events, and streams them to Kafka. Point it at the outbox table and every insert becomes a Kafka message automatically. No polling, no relay code, sub-second latency, and minimal query load on the database because the WAL is being read, not queried.

Debezium has a dedicated outbox event router for exactly this pattern:

{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "ledger-db",
    "database.port": "5432",
    "database.dbname": "ledger",
    "database.user": "debezium",
    "plugin.name": "pgoutput",
    "table.include.list": "public.outbox",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.route.by.field": "aggregate_type",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.payload": "payload",
    "transforms.outbox.table.expand.json.payload": "true"
  }
}

The EventRouter transform reads the outbox row and routes it: route.by.field picks the topic from aggregate_type, event.key sets the Kafka partition key from aggregate_id, and event.payload extracts the JSON body. You write a row, Debezium does the rest.

This is the better answer at scale, and it is what we run. But it is not free. You now operate Kafka Connect, you manage a replication slot in Postgres, and you have to understand WAL retention. A replication slot that falls behind because the connector is down will pin WAL segments, and your database disk will fill. That is a real incident with a specific signature: pg_wal growing without bound. You monitor replication slot lag the same way you monitor consumer lag.

The decision between polling and CDC comes down to operational maturity. If you already run Kafka Connect for other connectors, Debezium is an easy add and the right call. If you do not, the polling relay is a few hundred lines you fully control, and it will carry you a long way. Do not stand up Kafka Connect solely for the outbox if polling meets your latency budget.


The other half: consumers must be idempotent

The outbox guarantees at-least-once delivery to Kafka. It does not guarantee exactly-once. The relay can publish a row, crash before marking it published, restart, and publish it again. Debezium can re-emit events on connector restart or rebalance. Kafka itself can redeliver a message to a consumer that rebalances mid-batch.

So every consumer of these events must be idempotent. This is not optional, and it is the part teams forget. They build a beautiful outbox, congratulate themselves on never losing an event, and then a downstream service processes the same payment.settled event twice and double-credits an account.

The outbox row's id is your deduplication key. Carry it as a header on the Kafka message and have consumers track which ids they have processed.

@KafkaListener(topics = ["payment-events"])
fun consume(record: ConsumerRecord<String, String>) {
    val eventId = record.headers()
        .lastHeader("outbox-id")
        .value()
        .let { String(it).toLong() }

    val firstTime = processedEventRepository.insertIfAbsent(eventId)
    if (!firstTime) {
        // already processed, ack and move on
        return
    }

    applyBusinessLogic(record)
}

The insertIfAbsent is an INSERT ... ON CONFLICT DO NOTHING against a table of processed event ids, returning whether the row was new. If the event was seen before, the insert conflicts, returns false, and the consumer skips the work. To stay correct under a crash between processing and ack, the dedup insert and the business write should share one database transaction, otherwise you can process the work and lose the dedup record. This is the same idempotency contract you need on any payment-handling endpoint: see a request you have processed, return without redoing the work.

The processed-events table needs the same pruning discipline as the outbox. Partition by time, drop old partitions, and set the retention window longer than your maximum plausible redelivery window.


Ordering, and where it actually matters

A lot of energy gets spent on event ordering that does not need to exist. Global ordering across all events is expensive and almost never required. What you usually need is per-aggregate ordering: all events for account 12345 are processed in the order they were produced. Events for account 67890 are unrelated and can be processed in any order relative to 12345.

Kafka gives you per-partition ordering. As long as all events for one aggregate land on one partition, they stay ordered. That is why the partition key is aggregate_id. The outbox relay reads rows in id order, and per-partition ordering on the consumer side preserves it.

One caveat on the producer side: ordering only holds if the producer does not reorder in-flight batches on retry. Set enable.idempotence=true (or keep max.in.flight.requests.per.connection at 1) so a retried publish cannot land behind a later one.

Where this breaks is if you change the partition count. Repartitioning rehashes keys to partitions, and an aggregate's events can split across the old and new partition. If you must repartition, drain the topic first or accept a brief ordering anomaly during the transition. In practice, size your partition count for years of growth up front so you never repartition a topic that carries ordered events.

There is a subtler ordering trap with the polling relay and concurrent transactions. Identity columns allocate values at insert time, but transactions commit in a different order than they allocated ids. Transaction A grabs id 100, transaction B grabs id 101, and B commits before A. A poller using WHERE id > last_seen_id could read up to 101, advance its cursor to 101, and then A commits id 100, which now sits below the cursor and never gets read. This is the lost-update-via-cursor bug, and it is nasty because it only shows up under concurrent writes.

The fix is to not advance a high-water cursor based on id alone. Filter on published_at IS NULL with FOR UPDATE SKIP LOCKED, which reads any unpublished row regardless of id and maintains no fragile cursor. This is another reason the partial index on unpublished rows beats a cursor-based approach. Debezium avoids the problem entirely because the WAL reflects commit order, not allocation order.


Failure modes worth rehearsing

Build the outbox and you will eventually hit these. Better to know them before they hit you in production.

Kafka is down for an extended period. The outbox table grows. Unpublished rows accumulate. The system is degraded but correct: no events are lost, they are just delayed. Your alert should fire on unpublished row count or oldest-unpublished-row age, not on Kafka availability directly. When Kafka recovers, the relay drains the backlog. If the backlog is large, the drain produces a burst of events; make sure consumers can absorb it or rate-limit the relay.

A poison message. One outbox row has a payload that always fails to publish, maybe it exceeds the broker's max message size. The relay retries it forever and blocks every row behind it. This is why attempts exists. After a threshold, route the row aside and skip it so the queue keeps draining.

UPDATE outbox
SET attempts = attempts + 1
WHERE id = ?;

-- a row with attempts > 10 is excluded from the normal poll
SELECT id, payload FROM outbox
WHERE published_at IS NULL AND attempts <= 10
ORDER BY id
LIMIT 500
FOR UPDATE SKIP LOCKED;

A row with high attempts is excluded from the normal poll and surfaced for human inspection. Do not let one bad message strangle the pipeline.

The replication slot fills the disk. CDC-specific. The Debezium connector dies, nobody notices, the replication slot holds WAL, and the data directory fills until Postgres refuses writes, which takes down the entire application. Monitor pg_replication_slots for confirmed_flush_lsn falling behind, alert early, and set max_slot_wal_keep_size so Postgres invalidates a runaway slot rather than filling the disk. An invalidated slot drops events, which is bad, but a full disk takes down the database, which is worse. Pick your failure.

Duplicate delivery on relay restart. The relay publishes a row, the process dies before the UPDATE published_at commits, it restarts and republishes. The consumer's idempotency check absorbs this. This is the expected, designed-for case, not a bug. If your consumers are not idempotent, this is where you get burned, and it will happen on the first deploy that restarts the relay mid-batch.


What the contract actually buys you

The outbox pattern does not make your system exactly-once. Nothing makes a distributed system exactly-once. What it buys you is a clean separation of guarantees: the database transaction guarantees that the business state and the intent to publish are atomic, the relay guarantees at-least-once delivery to Kafka, and consumer idempotency guarantees that at-least-once does not cause double-processing. Each layer has one job and a guarantee it can actually keep.

The failure we started with, a transaction that committed without its event, becomes impossible. Not unlikely. Impossible, in the sense that the event's existence is welded to the transaction's existence by Postgres' commit. The cost is a small delivery delay and an extra table to operate. For anything that moves money, that is the cheapest insurance you will ever buy.

The broader lesson is the one underneath every reliable distributed system: do not try to make two systems agree by writing to both. Write to one, and let that one system's durability carry the truth to the others. The outbox is just the most useful instance of that idea.