- Published on
From Cron Jobs to Kafka: Scaling the Reconciliation Engine Behind Neo for Business
- Authors

- Name
- Anil Jaiswal
- @anil_jaiswal
Reconciliation is the part of fintech nobody talks about at conferences. There are no launch announcements for it, no design awards, no viral demos. It is, at its core, a bookkeeping problem: you have a record of every rupee that moved on your side, and the bank has its own record of the same rupees, and these two records must agree exactly. Every day. For every customer. With a full audit trail.
At Open.money, my team built and shipped Neo for Business (a banking and payments product for MSMEs, built on Axis Bank's core banking infrastructure). The reconciliation engine went through three distinct architectures under my watch, the last of which is running in production today at meaningful scale. The first one was a nightly cron job, wired into Laravel's task scheduler. It matched roughly 500 transactions a day, took three minutes, and nobody thought about it.
Over the following months, transaction volume grew toward targets that would make that scheduler entry arithmetically impossible. What changed wasn't just how many records we matched. It was how we modeled the problem. The core insight, arrived at through two intermediate architectures: reconciliation is a streaming join over time wearing a batch-job costume. The real adversary is time, not throughput.
The actual problem
A payment on Neo for Business generates two records of the same money. First, our internal ledger entry: the moment a merchant initiates a UPI or NEFT transfer, we write a row to our ledger marking the transaction as pending. That row carries a UTR (Unique Transaction Reference for UPI) or RRN (Reference Retrieval Number for card/netbanking transactions). Second, the bank's authoritative confirmation: Axis Bank's core banking system, a payment gateway callback, or a NACH settlement file from NPCI eventually produces its own record of the same event.
These two records arrive at different times. The internal entry arrives at T+0 seconds. The bank confirmation might arrive at T+30 seconds (UPI webhook callback), T+4 hours (intraday NEFT), or T+1 day (NACH settlement batch from NPCI). The reconciliation engine's job is to join these two records on their shared key and classify the result:
- Matched: both sides agree on amount, status, reference. Done.
- Timing break: one side arrived, the other hasn't yet. Not an error; the counterpart is still in transit. Do not alert.
- True break: both sides arrived but something is wrong: amount mismatch, duplicate settlement, missing credit. Requires human review and an immutable audit trail.
Banking-grade correctness means: never emit a true break that is actually a timing break, never miss a true break, never double-count a match, never lose an audit event. Those four constraints turned out to be the hardest part of the whole system, and each of the three architectures failed at a different one of them.
Architecture 1: Nightly cron
The first implementation was right for the volume it was built for. Laravel's task scheduler fired a single Artisan command at 2 AM, after the day's NACH settlement files had landed on SFTP.
The command was registered in App\Console\Kernel:
// app/Console/Kernel.php
protected function schedule(Schedule $schedule): void
{
$schedule->command('recon:run')
->dailyAt('02:00')
->withoutOverlapping()
->runInBackground()
->emailOutputOnFailure(config('recon.alert_email'));
}
The command itself was thin: just argument parsing, logging, and delegating to a service:
// app/Console/Commands/RunNightlyRecon.php
class RunNightlyRecon extends Command
{
protected $signature = 'recon:run {date? : Date to reconcile (Y-m-d), defaults to yesterday}';
protected $description = 'Run nightly reconciliation for a given date';
public function __construct(private ReconciliationService $recon)
{
parent::__construct();
}
public function handle(): int
{
$date = Carbon::parse($this->argument('date') ?? 'yesterday');
$this->info("Starting recon for {$date->toDateString()}");
try {
$result = $this->recon->reconcileDate($date);
$this->info("Done. Matched: {$result->matched}, Breaks: {$result->breaks}");
return Command::SUCCESS;
} catch (\Throwable $e) {
Log::error('Recon failed', ['date' => $date->toDateString(), 'error' => $e->getMessage()]);
$this->error($e->getMessage());
return Command::FAILURE;
}
}
}
ReconciliationService was bound as a singleton in a dedicated provider and injected via Laravel's DI container:
// app/Providers/ReconciliationServiceProvider.php
class ReconciliationServiceProvider extends ServiceProvider
{
public function register(): void
{
$this->app->singleton(ReconciliationService::class);
$this->app->singleton(SettlementFileParser::class);
$this->app->singleton(ReconResultRepository::class);
}
}
The service loaded both sides of the ledger into memory, joined them on UTR, and wrote results back:
// app/Services/ReconciliationService.php
class ReconciliationService
{
public function __construct(
private LedgerRepository $ledger,
private SettlementFileParser $parser,
private ReconResultRepository $results,
) {}
public function reconcileDate(Carbon $date): ReconResult
{
$internal = $this->ledger->getEntriesForDate($date); // Eloquent collection
$external = $this->parser->parseForDate($date); // SFTP files + callbacks
$internalIndex = $internal->keyBy('utr');
$externalIndex = $external->keyBy('utr');
$matches = collect();
$breaks = collect();
foreach ($internalIndex as $utr => $entry) {
if ($externalIndex->has($utr)) {
$matches->push($this->reconcile($entry, $externalIndex->get($utr)));
} else {
$breaks->push([
'utr' => $utr,
'type' => BreakType::MISSING_EXTERNAL,
'entry' => $entry,
'date' => $date,
]);
}
}
foreach ($externalIndex as $utr => $entry) {
if (!$internalIndex->has($utr)) {
$breaks->push([
'utr' => $utr,
'type' => BreakType::MISSING_INTERNAL,
'entry' => $entry,
'date' => $date,
]);
}
}
return $this->results->write($matches, $breaks, $date);
}
}
At 500 transactions/day this ran in under three minutes. The failure model was simple: the command either succeeded or failed with a non-zero exit code, which the scheduler caught and emailed. Rerunning it was safe because ReconResultRepository::write truncated the previous run's output for that date before inserting new results.
All unmatched entries were treated as true breaks at classification time, which was wrong in a specific way: NEFT transactions settled at T+1 would always appear as breaks in the nightly output, until the next morning's run confirmed them. We compensated by building a T+1 review window into the ops workflow. Operators knew to hold break alerts for 24 hours. That knowledge lived in people's heads, not in the system.
The job started showing cracks around 5,000 transactions/day with roughly 30 merchants on the platform:
Window overrun. SFTP files from certain PSPs sometimes arrived at 3 AM instead of 1 AM. The dailyAt('02:00') job would start with an empty external side, match nothing, write every internal entry as a break, and exit with Command::SUCCESS. Ops would wake up to thousands of false alerts.
All-or-nothing restart. A network blip to the bank SFTP mid-job meant starting from scratch. At 3 minutes per run this was tolerable. At 15 minutes it was painful. withoutOverlapping() protected against concurrent runs but did nothing for failed ones: a retry kicked off a full re-execution from transaction zero.
Memory ceiling. Loading two sides of 50,000 enriched transaction records as Laravel collections caused PHP memory limit exhaustion on the job runner. We could raise memory_limit. We could chunk the Eloquent queries. We started looking for the exit instead.
Recon lag. Merchants on Neo for Business couldn't see confirmed transaction statuses until 9 AM because the scheduler hadn't run. For a product selling itself as real-time business banking, T+9 hours recon lag was a brand problem before it was an engineering problem.
We could have patched the scheduler: file-wait logic, chunked DB reads, per-record retry. We started down that path and stopped. The correct diagnosis was that the lag was structural. No amount of tuning a dailyAt() entry produces near-real-time recon.
Architecture 2: Sharded parallel batch
The intermediate step bought throughput and killed the all-or-nothing failure mode. We partitioned the daily work into shards by account_id_hash % N and dispatched each shard as an independent Laravel Queue job.
A coordinator command built the shards and enqueued them:
// app/Console/Commands/RunShardedRecon.php
class RunShardedRecon extends Command
{
protected $signature = 'recon:run-sharded {date?} {--shards=8}';
public function __construct(private ShardingService $sharding)
{
parent::__construct();
}
public function handle(): int
{
$date = Carbon::parse($this->argument('date') ?? 'yesterday');
$shards = (int) $this->option('shards');
$this->sharding->buildShards($date, $shards)->each(
fn ($shard) => ProcessReconShard::dispatch($shard->id, $date)->onQueue('recon')
);
$this->info("Dispatched {$shards} shards for {$date->toDateString()}");
return Command::SUCCESS;
}
}
Each shard ran as a ShouldQueue job with its own retry budget:
// app/Jobs/ProcessReconShard.php
class ProcessReconShard implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public int $tries = 3;
public int $backoff = 60;
public function __construct(
public readonly string $shardId,
public readonly Carbon $date,
) {}
public function handle(ShardedReconciliationService $recon): void
{
if ($recon->isComplete($this->shardId, $this->date)) {
return; // idempotent skip, safe to re-enqueue
}
$result = $recon->processShard($this->shardId, $this->date);
$recon->markComplete($this->shardId, $this->date, $result);
}
public function failed(\Throwable $e): void
{
Log::error('Recon shard failed', [
'shard' => $this->shardId,
'date' => $this->date->toDateString(),
'error' => $e->getMessage(),
]);
}
}
isComplete / markComplete stored shard state in Redis. An SFTP parse failure on one bank's file killed one shard job; the other seven completed and were not re-run when the failed shard retried. Eight queue workers in parallel took us from 15-minute monolithic runs to 4-minute parallel runs at 20,000 transactions/day.
But the structural problems didn't go away. They changed shape.
Recon lag was still architectural. The fastest we could run was limited by when the last settlement file arrived. The $schedule->command('recon:run-sharded')->dailyAt('02:30') entry set the lag floor. No matter how fast each job was, we were processing yesterday's data.
Hot shards. A handful of large merchants ran hundreds of transactions daily. Sharding by account prefix concentrated those accounts on the same shard consistently, making it run four times longer than the others. We tried re-sharding by estimated transaction count, but the distribution shifted weekly as merchants scaled. Chasing hot shards is a game you don't win.
Orchestration debt. Managing eight concurrent jobs with per-shard Redis state, failure tracking, and a completion gate was complexity that belonged in a proper job scheduler. We were reinventing a stripped-down version of Horizon's queue management inside our application layer, badly, and the reinvention was brittle.
This architecture was right for the 10,000–30,000 transactions/day range. When we projected the growth curve toward 3 lakh MSMEs on the platform, it was clear the schedule-driven model had an expiry date. The problem wasn't shard count. It was the assumption that recon is a thing you do to yesterday's data.
Architecture 3: Event-driven Kafka
The reframe was simple to state, harder to implement correctly. Every ledger entry and every external confirmation is an event. Match them as they arrive. The recon state is the set of events that haven't yet found their counterpart.
Topic and partition design
# 24 partitions per topic: enough headroom for 24 parallel consumer instances
# before hitting partition limits, with room to grow without downtime
kafka-topics.sh --create \
--bootstrap-server kafka:9092 \
--topic recon.ledger-entries \
--partitions 24 \
--replication-factor 3 \
--config retention.ms=604800000 \
--config min.insync.replicas=2
kafka-topics.sh --create \
--bootstrap-server kafka:9092 \
--topic recon.bank-callbacks \
--partitions 24 \
--replication-factor 3 \
--config retention.ms=604800000 \
--config min.insync.replicas=2
kafka-topics.sh --create \
--bootstrap-server kafka:9092 \
--topic recon.settlement-lines \
--partitions 24 \
--replication-factor 3 \
--config retention.ms=604800000 \
--config min.insync.replicas=2
kafka-topics.sh --create \
--bootstrap-server kafka:9092 \
--topic recon.matched \
--partitions 24 \
--replication-factor 3
kafka-topics.sh --create \
--bootstrap-server kafka:9092 \
--topic recon.breaks \
--partitions 24 \
--replication-factor 3
The most important decision in this entire design: the partition key is the reconciliation key (UTR or RRN), not the merchant ID, not the date, not the source system.
Every producer (the LedgerService writing internal entries, the callback controller writing Axis Bank confirmations, the settlement file parser writing NPCI lines) must produce with key=utr. This guarantees both records for the same money land on the same Kafka partition and are processed by the same consumer instance. Without this, the matching join requires a cross-consumer coordination layer, and you've rebuilt the hot-shard problem in streaming clothing.
Laravel Kafka configuration
We installed the rdkafka PHP extension and wired Kafka into Laravel's service container through a dedicated provider:
// config/kafka.php
return [
'brokers' => env('KAFKA_BROKERS', 'kafka:9092'),
'security_protocol' => env('KAFKA_SECURITY_PROTOCOL', 'SASL_SSL'),
'sasl_mechanism' => env('KAFKA_SASL_MECHANISM', 'SCRAM-SHA-512'),
'sasl_username' => env('KAFKA_USERNAME'),
'sasl_password' => env('KAFKA_PASSWORD'),
'producer' => [
'acks' => 'all', // wait for all ISR acks before confirming
'enable.idempotence' => 'true', // exactly-once within Kafka topology
'retries' => '5',
'linger.ms' => '5',
],
'recon' => [
'consumer_group' => env('RECON_CONSUMER_GROUP', 'recon-engine-v3'),
'pending_ttl_hours' => env('RECON_PENDING_TTL_HOURS', 96),
'topics' => [
'ledger_entries' => 'recon.ledger-entries',
'bank_callbacks' => 'recon.bank-callbacks',
'settlement_lines' => 'recon.settlement-lines',
'matched' => 'recon.matched',
'breaks' => 'recon.breaks',
],
],
];
// app/Providers/ReconKafkaServiceProvider.php
class ReconKafkaServiceProvider extends ServiceProvider
{
public function register(): void
{
$this->app->singleton(KafkaProducerService::class, function () {
return new KafkaProducerService(
brokers: config('kafka.brokers'),
options: array_merge(config('kafka.producer'), [
'security.protocol' => config('kafka.security_protocol'),
'sasl.mechanisms' => config('kafka.sasl_mechanism'),
'sasl.username' => config('kafka.sasl_username'),
'sasl.password' => config('kafka.sasl_password'),
]),
);
});
$this->app->singleton(ReconMatchingService::class, function ($app) {
return new ReconMatchingService(
redis: $app->make('redis')->connection('recon'),
producer: $app->make(KafkaProducerService::class),
pendingTtl: (int) config('kafka.recon.pending_ttl_hours') * 3600,
topics: config('kafka.recon.topics'),
);
});
$this->app->singleton(BreakEmissionService::class, function ($app) {
return new BreakEmissionService(
producer: $app->make(KafkaProducerService::class),
topics: config('kafka.recon.topics'),
);
});
}
}
The producer
Any service that writes a ledger entry or processes a bank callback must produce with key=utr. We wrapped rdkafka's Producer in a singleton service so the DI container could inject it wherever it was needed:
// app/Services/KafkaProducerService.php
class KafkaProducerService
{
private \RdKafka\Producer $producer;
public function __construct(string $brokers, array $options = [])
{
$conf = new \RdKafka\Conf();
$conf->set('bootstrap.servers', $brokers);
foreach ($options as $key => $value) {
$conf->set($key, (string) $value);
}
$this->producer = new \RdKafka\Producer($conf);
}
public function produce(string $topicName, string $key, array $payload): void
{
$topic = $this->producer->newTopic($topicName);
$topic->produce(RD_KAFKA_PARTITION_UA, 0, json_encode($payload), $key);
$this->producer->flush(5000);
}
}
A typical call from LedgerService, where the UTR is set as the message key:
// app/Services/LedgerService.php (excerpt)
$this->kafka->produce(
config('kafka.recon.topics.ledger_entries'),
$transaction->utr, // partition key: must match what the bank side will use
[
'utr' => $transaction->utr,
'amount' => $transaction->amount,
'status' => $transaction->status,
'source' => 'INTERNAL',
'created_at' => $transaction->created_at->toIso8601String(),
]
);
The consumer: Artisan command + rdkafka loop
Laravel doesn't have a built-in Kafka consumer abstraction the way it has queue workers, so the consumer runs as a long-lived Artisan command. It subscribes to all three input topics, processes each message through the injected ReconMatchingService, and commits the offset manually, only after all side effects have succeeded.
// app/Console/Commands/ConsumeReconEvents.php
class ConsumeReconEvents extends Command
{
protected $signature = 'recon:consume';
protected $description = 'Start the Kafka consumer for the reconciliation engine (long-lived process)';
public function __construct(private ReconMatchingService $matching)
{
parent::__construct();
}
public function handle(): void
{
$conf = new \RdKafka\Conf();
$conf->set('bootstrap.servers', config('kafka.brokers'));
$conf->set('security.protocol', config('kafka.security_protocol'));
$conf->set('sasl.mechanisms', config('kafka.sasl_mechanism'));
$conf->set('sasl.username', config('kafka.sasl_username'));
$conf->set('sasl.password', config('kafka.sasl_password'));
$conf->set('group.id', config('kafka.recon.consumer_group'));
$conf->set('auto.offset.reset', 'earliest');
$conf->set('enable.auto.commit', 'false'); // manual commit only
$conf->set('isolation.level', 'read_committed');
$conf->set('max.poll.interval.ms', '300000');
$conf->set('session.timeout.ms', '45000');
$consumer = new \RdKafka\KafkaConsumer($conf);
$consumer->subscribe([
config('kafka.recon.topics.ledger_entries'),
config('kafka.recon.topics.bank_callbacks'),
config('kafka.recon.topics.settlement_lines'),
]);
$this->info('Recon consumer started.');
while (true) {
$message = $consumer->consume(10_000);
if ($message === null || $message->err === RD_KAFKA_RESP_ERR__TIMED_OUT) {
continue;
}
if ($message->err !== RD_KAFKA_RESP_ERR_NO_ERROR) {
Log::error('Kafka error', ['error' => $message->errstr()]);
continue;
}
try {
$this->matching->handle(
utr: $message->key,
payload: json_decode($message->payload, true),
);
$consumer->commit($message); // commit only after Redis write + produce succeed
} catch (\Throwable $e) {
Log::error('Recon message failed', [
'utr' => $message->key,
'error' => $e->getMessage(),
]);
// Do not commit: message will be redelivered on restart
}
}
}
}
Pending-state store
ReconMatchingService holds the matching logic. When one side of a transaction arrives before its counterpart, it parks in Redis with a 96-hour TTL:
// app/Services/ReconMatchingService.php
class ReconMatchingService
{
public function __construct(
private \Illuminate\Redis\Connections\Connection $redis,
private KafkaProducerService $producer,
private int $pendingTtl,
private array $topics,
) {}
public function handle(string $utr, array $payload): void
{
$source = $payload['source']; // 'INTERNAL' | 'EXTERNAL'
$opposite = $source === 'INTERNAL' ? 'EXTERNAL' : 'INTERNAL';
$matchedKey = "recon:matched:{$utr}";
$pendingKey = "recon:pending:{$opposite}:{$utr}";
$ownKey = "recon:pending:{$source}:{$utr}";
if ($this->redis->exists($matchedKey)) {
return; // idempotency gate: duplicate event, skip
}
$counterpart = $this->redis->get($pendingKey);
if ($counterpart !== null) {
$match = $this->buildMatch($payload, json_decode($counterpart, true));
// Atomic pipeline: write match marker, remove both pending entries
$this->redis->pipeline(function ($pipe) use ($matchedKey, $pendingKey, $ownKey, $match) {
$pipe->setex($matchedKey, $this->pendingTtl, json_encode($match));
$pipe->del($pendingKey);
$pipe->del($ownKey);
});
$this->producer->produce($this->topics['matched'], $utr, $match);
} else {
// Park this side and wait for the counterpart
$this->redis->setex($ownKey, $this->pendingTtl, json_encode($payload));
}
}
}
The TTL is the windowing strategy. A pending entry that ages past 96 hours without a counterpart becomes a break candidate, but not automatically a true break. A separate sweeper Artisan command (some things genuinely are time-triggered) scans aged pending entries and classifies them: if the settlement file for that period and bank is confirmed received and parsed, it's a true break; if the file hasn't arrived yet, the entry stays pending regardless of age.
Break resolution workflow
Breaks are emitted as events and feed a resolution service:
// app/Services/BreakEmissionService.php
class BreakEmissionService
{
public function __construct(
private KafkaProducerService $producer,
private array $topics,
) {}
public function emit(string $utr, array $availableSide, BreakType $breakType): void
{
$this->producer->produce($this->topics['breaks'], $utr, [
'id' => (string) Str::uuid(),
'utr' => $utr,
'break_type' => $breakType->value,
'available_side' => $availableSide,
'detected_at' => now()->toIso8601String(),
'status' => 'OPEN',
'audit_trail' => [[
'event' => 'BREAK_DETECTED',
'at' => now()->toIso8601String(),
'by' => 'recon-engine',
]],
]);
}
}
When ops resolves a break (manually matching a transaction with a mis-keyed UTR, confirming a bank credit that arrived on a delayed file, flagging a true duplicate), the resolution is appended to audit_trail and the break status transitions to RESOLVED. Every state transition is replayable from Kafka's retained log.
Running consumers in production: Supervisor
php artisan recon:consume is a long-lived process, not a one-shot command. We use Supervisor to manage eight consumer instances, one per active partition group, so that if a process crashes or is OOM-killed, it restarts automatically without manual intervention:
; /etc/supervisor/conf.d/recon-consumer.conf
[program:recon-consumer]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/html/artisan recon:consume
directory=/var/www/html
user=www-data
numprocs=8
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
stopwaitsecs=3600
redirect_stderr=true
stdout_logfile=/var/log/supervisor/recon-consumer.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=5
stopwaitsecs=3600 gives an in-flight message up to an hour to finish before Supervisor sends SIGKILL on a graceful shutdown. Kafka's session.timeout.ms=45000 gives the broker 45 seconds to detect a dead consumer and trigger a rebalance before the other seven instances absorb its partitions.
With 8 consumers across 24 partitions, recon lag dropped from T+8 hours to under 5 minutes for UPI and card transactions. NACH transactions still carried T+1 lag because that's when the bank file arrives. The system is no longer the constraint.
The hard parts
Partition key is an architecture decision
I've seen teams partition their recon topics by merchant ID. It feels natural: merchant-level isolation, easy to reason about shard distribution. It's wrong.
If you partition by merchant ID, every ledger entry for merchant M goes to partition 7. But bank callbacks contain a UTR and an amount, and they don't always carry a reliable merchant reference in the payload. A webhook from Axis Bank's payment gateway carries whatever the gateway decides to key on, which may be a batch sequence number with no relationship to your merchant ID scheme. Now your two sides land on different partitions, consumed by different consumer instances. The join is broken. Your options are a cross-consumer shared state store (kills throughput, adds coordination failure modes) or a re-keying router stage (adds latency and a new single point of failure).
The reconciliation key (UTR for UPI, RRN for card/netbanking) is the only correct partition key. This is a contract between LedgerService, the webhook controller, and the settlement file parser. Violating it produces a failure mode that looks like mysteriously low match rates with no errors anywhere in the logs, not a crash.
Exactly-once is at-least-once plus discipline
Kafka's exactly-once semantics (EOS) work within the Kafka topology: idempotent producers paired with transactional produce-consume cycles give you a strong guarantee across Kafka topics. The moment you write to Redis as part of the same processing step, you're outside Kafka's transaction boundary, and exactly-once becomes your responsibility to implement.
Our processing loop has three writes: Redis state store, the matched/break output topic, and the offset commit. These cannot be made atomic across all three systems simultaneously. The practical options are:
- At-most-once: commit offset first, write to Redis second. Events can be lost on crash. Unacceptable for money.
- At-least-once: write to Redis first, commit offset last. Events can be processed twice on crash, but never lost.
- Effectively-once: at-least-once plus idempotent processing. The
recon:matched:{utr}key in Redis is the idempotency gate. A duplicate event hits theexistscheck and returns without side effects.
We built the third. The edge case that still bites: if Redis is unavailable when we try to write the matched marker, we cannot safely commit the offset. Our fallback is to not commit: the consumer throws, Supervisor restarts the process, and the message is redelivered. During Redis downtime, recon stalls but does not corrupt. We'd rather stall than double-match a transaction.
Window tuning is where the hard bugs lived
The initial 72-hour pending-entry TTL felt generous. It wasn't.
Bank settlement files from certain PSPs arrived on Monday for Friday–Saturday–Sunday transactions. That's T+2 for Friday, T+1 for Saturday, T+0 for Sunday, all in the same file. A 72-hour window barely covered Friday's transactions, and only if the Monday file arrived on time. One Monday, a PSP's file arrived at T+74 hours due to a processing delay on their end. We had a batch of Friday transactions that had already aged past TTL, been swept as breaks by the sweeper, emitted to recon.breaks, and in several cases been reviewed and marked RESOLVED by ops, who reasonably assumed they were true breaks given the age.
The file arrived. The consumer tried to match those UTRs. Found no pending entries in Redis (TTL expired). Re-emitted them to recon.breaks as new breaks. Ops had duplicate break tickets for the same transactions.
The fix was two-part: extend the TTL to 96 hours (the empirical worst case we'd observed, plus two hours of margin), and add a settlement file ledger. Every bank's settlement file gets logged when it's received and parsed. The sweeper only graduates a pending entry to a true break if the corresponding bank's settlement file for that period is confirmed received. No file for that bank on that date means the entry stays pending regardless of age.
The lesson: your TTL is implicitly an SLA on how long you'll wait before calling something a true break. Derive it from real file arrival latency at the p99 level, not from intuition. We should have instrumented this from day one. Instead we guessed 72 hours, paid for it, and fixed it after the incident.
The production moment
During the migration from Architecture 2 to Architecture 3, we ran both systems in parallel for three weeks: the sharded batch job as the system of record, the Kafka engine in shadow mode producing to a separate set of topics not connected to any downstream service.
On day eleven of the shadow run, the outputs diverged by 47 breaks. The batch job said 47 true breaks. The Kafka engine said 23 of them were matches.
We traced the 23 discrepancies. All of them were transactions where the external settlement file had arrived during the business day, after the previous night's batch had already run and classified those entries as breaks, but before midnight when the next batch would run. The Kafka engine, running continuously, had matched them when the file arrived at 3 PM. The batch job wouldn't see those matches until the following morning.
What looked like a divergence was actually the Kafka engine being right and the batch being wrong. Those 47 "true breaks" that ops reviewed every morning included nearly 50% that were actually timing breaks. Ops had built an informal rule: "ignore the UPI breaks and check again tomorrow." That rule was compensating for a broken classification at the system level, and we hadn't known it.
We cut the Kafka engine over to be the system of record on day sixteen instead of day twenty-one. Ops break queue dropped by 40% overnight, not because we fixed anything, but because we stopped generating phantom breaks from a schedule-driven system trying to approximate a real-time problem.
Lessons
Cron was correct for 10 customers. If we'd built Kafka on day one, we'd have spent three months on infrastructure for a system processing 500 transactions a day. Laravel's dailyAt() entry was the right call. The error would have been staying on it past its useful life.
Change the model, not just the machinery. The move from batch to streaming required reframing the problem: recon isn't "process yesterday's data," it's "match events as they arrive." That reframing changed the data model (pending-state store instead of day-partitioned tables), the failure model (per-event idempotency instead of per-job retry), and the correctness model (timing breaks as a first-class state, not a deferred true break). You cannot get there by adding more shards to a batch job.
Partition-key choice is architecture. What you partition on sets the ceiling for what joins are possible without cross-consumer coordination. Get it wrong and you rebuild the coordination overhead you were trying to avoid. In a recon system, the reconciliation key is the only correct partition key, and every producer in the system must agree on it.
Exactly-once in fintech is at-least-once plus discipline. Kafka EOS covers the Kafka topology. The moment you cross into Redis or a database, you're writing idempotency checks yourself. Make them explicit, test the crash paths, and make sure the idempotency key TTL outlasts your retry window by a safe margin.
The TTL is an implicit SLA. Your pending-entry window determines when a timing break becomes a true break. Set it from empirical settlement latency data, p99 or higher, and build explicit logic to handle the case where the bank file hasn't arrived rather than letting age alone make the classification decision.
Reconciliation is still unglamorous. It runs in the background while the product team demos the merchant dashboard. But it is the thing that makes everything else trustworthy. A payments product that moves money correctly but cannot prove it moved money correctly is not actually correct. The audit trail and the match rate are the proof. Getting that right, through three architectures, one production miss, and a lot of time-window debugging, is the work.