“We Have Exactly-Once Delivery” The Lie Your Architecture Is Telling You Inside one component Single DB transaction · one partition · one process ACID / EOS scope you actually paid for This is where “exactly-once” is not a fantasy Across a boundary HTTP · queue · another DB · SMTP · WebSocket… A B Ambiguous timeouts → retries → duplicates Design for at-least-once + idempotency Every cross-component hop is a separate guarantee — name it in your design doc

There’s a sentence that shows up in design docs, onboarding wikis, and Slack threads more than almost any other:

“Don’t worry, our message queue guarantees exactly-once delivery.”

It feels good to write. It closes the conversation. No more “what if the message fires twice?” No more “what if it gets lost?” Exactly-once. Done. Ship it.

Here’s the problem: it is almost always a lie — not because the queue is broken, but because exactly-once only survives inside a single component. The moment your flow crosses a component boundary, you have entered at-least-once territory at best, and you probably don’t know it yet.

This post is about that boundary. Where the myth lives. Where it dies. Every shape it takes in production. And what systems that actually work do instead.

Part 1 — The Myth Zoo: Every Shape This Lie Takes in Production

Before we go deep on mechanism, let’s name every version of this myth. Because engineers repeat the same lie in a dozen different costumes.

Myth 1: “We use Kafka — it has exactly-once semantics”

What engineers hear: Kafka guarantees exactly-once. We use Kafka. Therefore our system has exactly-once.

What’s actually true: Kafka EOS (exactly-once semantics) applies to one specific scope — reading from a Kafka topic and writing to another Kafka topic using the transactional API. Kafka-to-Kafka. That’s it.

The moment your consumer does anything else — writes to Postgres, calls an HTTP endpoint, pushes to Redis, increments a counter in ClickHouse — you are outside Kafka’s EOS scope. Kafka cannot make promises on behalf of your database. Your database did not sign the EOS contract.

Where it breaks: Consumer reads message, writes to DB (success), crashes before acking offset. Kafka re-delivers. DB gets duplicate write. Kafka did its job perfectly. Your system has a duplicate row.

Myth 2: “We acknowledge only after processing — so we’re safe”

What engineers hear: Ack-after-process means no message is lost. Therefore we have exactly-once.

What’s actually true: Ack-after-process gives you at-least-once, not exactly-once. The difference is subtle but brutal.

Ack-after-process means: “I will not mark this message done until I’ve processed it.” That’s the no-loss guarantee. But it says nothing about what happens when your processing succeeds and then your ack fails. The broker re-delivers. You process again. Now you’ve processed twice.

Ack-after-process + idempotent consumer = effectively exactly-once. But engineers often implement the first half and skip the second.

Where it breaks: Payment service. chargeCard() succeeds. ack() times out. Broker re-delivers. chargeCard() runs again. Customer charged twice. Team discovers this at 2am via a support ticket.

Myth 3: “We have a unique constraint in the DB — duplicates are impossible”

What engineers hear: The DB rejects duplicates. So the system is safe.

What’s actually true: The unique constraint prevents duplicate rows in the DB. It does not prevent the downstream effects of the failed duplicate write from causing problems — and it definitely does not prevent the upstream system from retrying after seeing the constraint error.

Where it breaks:

  • Consumer 1 writes order row (success)
  • Consumer 1 acks (fails — network blip)
  • Broker re-delivers to Consumer 2
  • Consumer 2 tries to write order row → unique constraint violation → exception thrown
  • Consumer 2’s exception handler… does what? If it logs and moves on: fine. If it sends an error notification to the user: user gets “order failed” after their order actually succeeded. If it triggers a retry loop: you have an infinite loop hammering your DB.

The constraint is necessary. It is not sufficient. The error path after the constraint violation is where the real bugs live.

Myth 4: “We use SQS FIFO — it deduplicates for us”

What engineers hear: SQS FIFO has a 5-minute deduplication window. Duplicates are handled.

What’s actually true: SQS FIFO deduplicates within SQS, within a 5-minute window, based on a MessageDeduplicationId that you provide. This prevents SQS from delivering the same message twice to your consumer.

What it does not prevent:

  • Your consumer processing the message and then crashing before deleting it from the queue. SQS re-delivers after visibility timeout (default: 30 seconds). If your processing was idempotent: fine. If not: you’ve processed twice.
  • Your consumer deleting from SQS and then crashing before writing to DB. Message is gone from SQS. DB has nothing. At-most-once.
  • Two consumers racing on the same message within the visibility timeout (rare but possible under high load).

The deduplication window is also not infinite. If you have a batch job that runs every 6 minutes and sends the same event twice in 6-minute intervals, SQS FIFO will not catch it.

Myth 5: “We use database transactions — so our writes are atomic”

What engineers hear: We wrap everything in a transaction. It’s all-or-nothing. So we have exactly-once.

What’s actually true: Database transactions give you atomicity within the database. ACID is real and beautiful — but it only covers what the database controls.

The lie appears when engineers wrap two operations in a transaction and one of them touches an external system:

await db.transaction(async (trx) => {
  await trx("orders").insert(order);        // DB — inside transaction ✓
  await sendEmail(user.email, "confirmed"); // SMTP server — NOT inside transaction ✗
});

If sendEmail fails, the transaction rolls back. Order is not in DB. No email sent. Fine.

If sendEmail succeeds and then the transaction fails (DB write conflicts, constraint violation, connection drop): order is not in DB. Email is already sent. You cannot un-send an email. SMTP has no rollback.

The transaction gave you atomicity for the DB operations. It had zero control over the SMTP call that happened inside its scope.

Where it breaks: User gets “Order Confirmed” email. They click the link. Order doesn’t exist. Support ticket. Angry user. Post-mortem.

Myth 6: “Our microservice retries with exponential backoff — so nothing is ever lost”

What engineers hear: We retry everything. So the system is reliable.

What’s actually true: Retry with exponential backoff prevents message loss (at-least-once). It does not prevent duplication. Every retry is a potential duplicate execution.

Where it breaks: POST /payments/charge with a 30-second timeout. Your HTTP client sends the request. Network is slow. After 30 seconds, the client retries. Meanwhile, the first request arrives at the payment processor, succeeds, and the response is lost in the network. Second request also succeeds. Two charges. One customer.

The fix is idempotency keys — but most teams implement retry without idempotency keys, because retry feels like safety and idempotency keys feel like extra work.

Myth 7: “We use Redis pub/sub — it’s fast and reliable”

What engineers hear: Redis is fast. Pub/sub is real-time. Reliable enough.

What’s actually true: Redis pub/sub is at-most-once by design. If a subscriber is disconnected when a message is published, that message is gone forever. Redis pub/sub has no persistence, no buffering, no replay. Fire and forget is the contract — it just doesn’t say that prominently in the tutorial.

Where it breaks: You publish a notification via Redis pub/sub. The WebSocket server that subscribes to this channel is in the middle of a restart (deploy, crash, OOM kill — all common). The message is published to zero active subscribers. It evaporates. The user never gets their notification.

Redis Streams is a different story — that has persistence, consumer groups, and ack semantics. But engineers often reach for pub/sub because it’s simpler, and then treat it as reliable.

Myth 8: “We store events in Kafka forever — we can always replay”

What engineers hear: Kafka has log compaction. We can replay. So we can fix duplicates later.

What’s actually true: Replay fixes lost messages. It does not fix the downstream side effects of messages that were already processed. If your consumer wrote a row, sent an email, and charged a card — replaying Kafka will run all of those again. Replay without idempotent consumers is a chaos injection tool.

Where it breaks: Incident. Some events were dropped. Team decides to replay from offset X. Replay re-runs 50,000 events. 50,000 emails go out. 12,000 payments are double-charged. This is a real incident pattern. The replay is fine. The consumers weren’t ready.

Myth 9: “We use gRPC with retries — it’s more reliable than HTTP”

What engineers hear: gRPC has built-in retry. It must be better.

What’s actually true: When retries are configured, gRPC-style retry hooks fire on failed responses — not when the server succeeded and the response was dropped on the wire. With retries enabled under those conditions, the client may send again, the server processes again, and you have a duplicate. Your own timeouts and retry loops have the same class of bug regardless of transport.

Important nuance: gRPC does not retry by default — you opt in via service config (or your own client wrapper). When you do configure built-in retries, typically only UNAVAILABLE is retried unless you explicitly widen the code list. So assuming “gRPC retries everything” is wrong twice: nothing retries until you wire it, and the default allowlist is narrow.

Where it breaks: Order service calls inventory service via gRPC. Inventory service decrements stock (success), sends response, response is lost in transit. Order service times out, retries. Inventory decrements again. Stock goes negative. System sells items it doesn’t have.

Myth 10: “We have idempotency — we’re fine”

What engineers hear: We made our endpoint idempotent. Duplicate requests are handled.

What’s actually true: Idempotency is the right idea, but the implementation has sharp edges.

The check-then-act race condition:

-- Consumer A                    -- Consumer B (same message, parallel)
SELECT * FROM processed_events   SELECT * FROM processed_events
WHERE event_id = 'abc';          WHERE event_id = 'abc';
-- not found                     -- not found
-- both passed the idempotency check (no claim row yet)
INSERT INTO orders ...           INSERT INTO orders ...
-- both are now calling chargeCard()
-- the INSERT into orders may conflict (if keyed),
-- but chargeCard() may have already fired twice

Without a concurrent-safe claim on processed_events (e.g. SELECT FOR UPDATE on the check, or an insert-wins race pattern that you can detect), both consumers can clear the same logical gate and both run the side effect.

The TTL expiry problem: You store processed event IDs in Redis with a 24-hour TTL. An event arrives, is processed, ID stored. Same event arrives again 25 hours later (delayed consumer, re-queued message, replay). TTL has expired. Idempotency check passes. Event processed again.

Idempotency is necessary. Idempotency done wrong is a false sense of security.

Myth 11: “Our event sourcing setup means we have full auditability and no duplicates”

What engineers hear: Event sourcing is the gold standard. We’re safe.

What’s actually true: Event sourcing gives you an append-only log of what happened. It does not prevent the same event from being appended twice, and it does not prevent projections (read models) from processing the same event twice during a rebuild.

Where it breaks: Event stream rebuild. You rebuild the OrderSummary projection from scratch. The rebuild consumer is not idempotent. Events 4821 through 4830 are processed twice due to a checkpoint bug. Order totals are wrong. Nobody notices for three days because the UI looks normal.

Myth 12: “We use a webhook with HMAC verification — so only legitimate events come through”

What engineers hear: HMAC means we know the event is real. So we’re safe.

What’s actually true: HMAC verification tells you the event came from who you think it came from. It says nothing about delivery guarantees. Stripe, for example, retries webhooks up to 72 hours if your endpoint returns a non-2xx. Your endpoint may be called multiple times for the same event. HMAC is about authenticity, not exactly-once.

Where it breaks: Your webhook endpoint processes a payment.succeeded event. It’s slow — takes 8 seconds. Stripe’s timeout is 5 seconds. Stripe sees no response, marks it failed, retries. Your handler finishes at second 8 and responds 200 to a connection that’s already gone. Stripe retries. Your handler runs again. Duplicate fulfillment.

Part 2 — The Deep Anatomy of the Two-Component Lie

Now that we’ve named every costume this myth wears, let’s go to first principles. Why does exactly-once die at the boundary between two components? What is physically happening?

The Gap Is Real and Physical

When two components communicate, there is a network between them. Networks are unreliable — not theoretically, but physically and measurably. Packets are dropped. Routers are congested. TCP connections time out. DNS resolution fails. TLS handshakes expire.

This means every cross-component operation has three possible outcomes:

Caller sends request
        │
        ├── Success: response received ✓
        │
        ├── Failure before processing: request never arrived ✗ (lost)
        │
        └── Failure after processing: request processed, response lost ✗ (ambiguous)

The third outcome is the killer. The caller does not know whether the callee processed the request or not. It only knows it didn’t get a response. From the caller’s perspective, the operation failed. From the callee’s perspective, the operation succeeded.

This is sometimes called the “two generals problem” or the “timeout ambiguity”. It is not solvable at the network layer. It is a fundamental property of distributed systems.

The consequence: any retry after a timeout is potentially a duplicate. You cannot retry safely without idempotency on the callee side.

Caller sends a request — three possibilities 1 · Success Response received Caller knows work status ✓ 2 · Lost before work Request never arrived Safe to retry (usually) 3 · Ambiguous Work may have succeeded Response lost in transit Retry ⇒ duplicate risk You cannot retry safely after a timeout without callee-side idempotency

DB + WebSocket: The Canonical Two-Component Failure

Let’s go very deep on the most common pattern: write to a database, then push a notification via WebSocket.

HTTP Request → Handler → DB.write() → WS.push() → HTTP Response

This seems like one operation. It is four network calls (HTTP in, DB write, WS push, HTTP out) and three component boundaries (client→server, server→DB, server→WS hub). Each boundary is a failure surface.

One HTTP handler — several failure surfaces Client HTTP Server SQL Postgres WS WS hub WS Client Crash after DB but before push · push fails · 500 with order already inserted · HTTP 200 lost · client reconnects after push Five different failure modes — zero end-to-end “exactly-once” without explicit design

Failure mode 1: DB write succeeds, server crashes before WS push

[Client] ──POST /order──► [Server]
                              │
                         DB.write() ✓
                              │
                         💥 Server dies
                              │
                        WS.push() never runs

Result: Order exists in DB. User gets no real-time notification. If the client retries the HTTP request: server runs again, DB write hits unique constraint (if you have one), WS push fires. User gets notification on retry. If the client doesn’t retry: order exists, user is confused, they refresh the page, they see the order (because it’s in DB), they never knew the real-time push failed. Probably fine in practice — but now your “real-time” feature is actually “eventual.”

Failure mode 2: DB write succeeds, WS push fails

[Client] ──POST /order──► [Server]
                              │
                         DB.write() ✓
                              │
                         WS.push() ✗ (connection dropped, hub unreachable)
                              │
                         Server returns 500 to client

Result: Order in DB. Client sees 500 error. Client retries. DB write hits unique constraint — exception. Now your handler is in an error state for a request that partially succeeded. Does your error response to the client say “order failed”? It didn’t fail. It succeeded — partially.

Failure mode 3: WS push succeeds, DB write fails

If you push the WebSocket notification before the DB write (wrong order, but some engineers do it for latency reasons):

[Client] ──POST /order──► [Server]
                              │
                         WS.push() ✓
                              │
                         DB.write() ✗ (constraint, timeout, deadlock)
                              │
                         Server returns 500

Result: User sees “Order Confirmed!” notification on screen. Order does not exist in DB. Client retries. DB write succeeds. User now has an order that they were told about before it existed. If they clicked the link in the notification before the retry, they got a 404.

Failure mode 4: Everything succeeds, client doesn’t receive the HTTP response

[Client] ──POST /order──► [Server]
                              │
                         DB.write() ✓
                              │
                         WS.push() ✓
                              │
                         HTTP 200 ──✗── [Client] (response lost)

Result: Order in DB. WS push delivered. Client didn’t get the HTTP 200. Client’s timeout fires. Client retries. Everything runs again. If not idempotent: two orders, two WS pushes.

Failure mode 5: Client receives HTTP 200, WS message arrives out-of-order

[Client] ──POST /order──► [Server]
                              │
                         DB.write() ✓
                              │
                         WS.push() → [WS Hub] → [Client: not connected yet]
                              │
                         HTTP 200 ──► [Client]
                         Client now connects WebSocket
                         WS push was already fired and missed

Result: Order confirmed. HTTP response received. WS notification was already fired to a client that wasn’t listening yet. Client connected to WebSocket after the push. No notification ever received.

Five failure modes. One handler. Zero exactly-once.

DB + Kafka producer: The Backend-to-Backend Version

Same anatomy, different components. Your service writes to DB and produces to Kafka.

Handler → DB.write() → kafka.produce() → ack

Failure mode 1: DB write succeeds, produce fails — Handler returns error. Upstream retries. DB write hits constraint. If you swallow the constraint error and just retry the produce: you get a Kafka message for an order you already wrote — Kafka message is “new” from Kafka’s perspective, downstream consumers don’t know you’ve already processed it.

If you don’t swallow: upstream keeps retrying the whole operation. Each retry, DB write hits the constraint, produce is never attempted. You’re stuck in a retry loop that can never succeed.

Failure mode 2: DB write succeeds, produce succeeds, produce ack times out — Producer produced the message. Kafka wrote it. The ack was lost in the network. Your producer sees a timeout, assumes failure, retries the produce. Kafka gets the same message twice. Without idempotent producer enabled (enable.idempotence=true), Kafka writes both. Downstream gets the message twice.

With enable.idempotence=true on the producer: Kafka deduplicates the retry on the broker side. The duplicate produce is dropped. Good — but this only covers producer-side retries. It does not cover the case where your handler retries the entire operation from scratch (new producer, new sequence number).

Failure mode 3: Produce succeeds, DB write fails — If you produce before you write (wrong, but happens): message is in Kafka. Downstream will process it. Your DB has nothing. Downstream writes based on the Kafka message. Now your DB is out of sync with what actually happened. This is the outbox pattern violation — the external system has state that your source-of-truth DB doesn’t have.

Kafka consumer + HTTP call: The Service Mesh Version

Your consumer calls a downstream service via HTTP — the pattern that breaks most microservice meshes.

Kafka.poll() → HTTP.call(downstream) → Kafka.ack(offset)

Failure mode 1: Your HTTP call blocks for 35 seconds. Kafka’s default max.poll.interval.ms is 300,000ms (5 minutes). But if someone configured it lower: Kafka considers your consumer dead, triggers a rebalance, assigns the partition to another consumer. That consumer polls the same offset, makes the same HTTP call. Two concurrent HTTP calls for the same message. Both may succeed.

Failure mode 2: HTTP call succeeds, ack fails, rebalance happens — another consumer gets the partition, re-processes the same message, makes the HTTP call again. Downstream receives the same request twice.

Failure mode 3: You implement idempotency on your consumer and pass an idempotency key to the downstream HTTP call — but the downstream service doesn’t support idempotency keys. It processes every request independently. Your idempotency is local. The downstream processes every call you make, including retries.

This is a distributed system smell: idempotency must be end-to-end, not just local. If any component in the chain processes duplicates, duplicates propagate.

Part 3 — What Actually Works: The Complete Technical Playbook

Now the real engineering.

The Transactional Outbox Pattern — In Full Detail

The Outbox pattern solves the DB + queue boundary problem. It’s worth understanding completely, including the sharp edges.

The core idea: Instead of writing to DB and then to queue (two operations, two components, gap between them), you write to DB only — but you write to two tables in the same transaction.

┌─────────────────────────────────────────────────────────┐
│                      Your Postgres                       │
│                                                         │
│  BEGIN TRANSACTION                                       │
│    INSERT INTO orders (id, user_id, total, status)      │
│      VALUES ('ord_123', 42, 2999, 'created');           │
│                                                         │
│    INSERT INTO outbox (id, topic, payload, created_at)  │
│      VALUES (gen_random_uuid(),                         │
│              'orders.created',                          │
│              '{"order_id":"ord_123","user_id":42}',     │
│              NOW());                                    │
│  COMMIT                                                 │
└─────────────────────────────────────────────────────────┘
           │
           │  (separate process — outbox poller)
           ▼
┌─────────────────────┐
│   Outbox Poller     │
│                     │
│   SELECT * FROM     │
│   outbox            │
│   WHERE processed   │
│   = false           │
│   ORDER BY          │
│   created_at        │
│   LIMIT 100         │
│   FOR UPDATE        │
│   SKIP LOCKED       │
└──────────┬──────────┘
           │
           ▼
   ┌───────────────┐
   │  Kafka / SQS  │
   │  / Redis      │
   │  Streams      │
   └───────┬───────┘
           │  (on success)
           ▼
   UPDATE outbox
   SET processed = true
   WHERE id = $1

Visually, the same idea:

Transactional outbox Single DB transaction INSERT orders … INSERT outbox (topic, payload) … COMMIT — both or neither Outbox poller (separate process) FOR UPDATE SKIP LOCKED · publish Kafka / SQS / Redis Streams On success → mark outbox processed

Why it works: The DB transaction is ACID. Either both the orders row and the outbox row are written, or neither is. There is no gap between “order created” and “notification queued” — they are the same atomic operation.

The outbox poller is a separate process. It can crash and restart without losing messages — the outbox row is still in the DB. It can fail to deliver — it just retries. The outbox row stays until it’s explicitly marked processed.

The SKIP LOCKED detail: If you run multiple outbox poller instances (for throughput), they’ll race on the same rows. SELECT … FOR UPDATE SKIP LOCKED ensures each row is locked by exactly one poller at a time. A locked row is skipped by other pollers. This prevents double-delivery at the poller level — you still need idempotency downstream, but you reduce the duplicate rate significantly.

Outbox cleanup: Processed outbox rows accumulate. You need a cleanup job:

DELETE FROM outbox
WHERE processed = true
  AND created_at < NOW() - INTERVAL '7 days';

Run this on a schedule. Monitor outbox table size. A growing outbox is an alert — it means your poller is not keeping up.

Change Data Capture (CDC) as an alternative poller: Instead of polling the outbox table, you can use CDC tools like Debezium to tail the Postgres WAL (Write-Ahead Log) and publish outbox changes as they happen. This gives you lower latency (no polling interval) and reduces DB load. The tradeoff: Debezium is another system to operate.

Primary failover: During primary failover, the poller may lose its connection mid-poll; FOR UPDATE locks are released. After promotion, another poller instance can briefly see an outbox row that already produced a broker message but is not marked processed yet — leading to duplicate publish attempts. Treat that window as normal: downstream idempotency is your safety net here.

Idempotency — The Complete Implementation

Option 1: Idempotency table with SELECT FOR UPDATE

CREATE TABLE processed_events (
  event_id     TEXT PRIMARY KEY,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Consumer logic:

BEGIN;
  -- Lock the row if it exists, or insert if it doesn't
  INSERT INTO processed_events (event_id)
  VALUES ($1)
  ON CONFLICT (event_id) DO NOTHING;

  -- Check if we just inserted or if it was already there
  -- If already there: rollback and skip
  -- If we inserted: process the event

  -- ... your processing logic here ...

COMMIT;

After the insert, check rows_affected or use INSERT … RETURNING — with ON CONFLICT DO NOTHING, some drivers report zero rows affected in both the genuine insert and the conflict path, so you cannot infer “winner vs duplicate” without an explicit probe.

The ON CONFLICT DO NOTHING with a unique primary key is your concurrency-safe idempotency check. Two concurrent consumers with the same event_id will race on the INSERT. One wins, one does nothing. No double-processing.

Option 2: Conditional update with version number

UPDATE orders
SET status = 'paid',
    version = version + 1,
    updated_at = NOW()
WHERE id = $1
  AND version = $2  -- must match expected version
  AND status != 'paid';  -- must not already be paid

-- Check rows_affected
-- 0: already processed (idempotent no-op)
-- 1: success

Option 3: Natural idempotency — design the operation to be inherently safe

-- Not idempotent
INSERT INTO notifications (user_id, message) VALUES ($1, $2);

-- Idempotent
INSERT INTO notifications (event_id, user_id, message)
VALUES ($1, $2, $3)
ON CONFLICT (event_id) DO NOTHING;

Option 4: Idempotency keys for external API calls

Stripe, Braintree, Adyen all support idempotency keys. The key is a unique string you generate per logical operation. If you call the API twice with the same key, the second call returns the first call’s response without processing again.

await stripe.paymentIntents.create({
  amount: 2999,
  currency: 'inr',
  customer: customerId,
}, {
  idempotencyKey: `order_${orderId}_charge_v1`,
});

Generate the key deterministically from your data — not randomly. If you generate it randomly, a retry generates a different key and idempotency doesn’t work.

WebSocket Real-Time: The Complete Pattern

WebSockets are at-most-once by default. Here’s the full stack to make them effectively exactly-once.

Layer 1: Sequence numbers — Every message from server to client gets a monotonically increasing sequence number per client session or per channel:

{
  "seq": 4821,
  "type": "order.confirmed",
  "payload": { "order_id": "ord_123", "total": 2999 }
}

The client tracks lastSeq. On reconnect, the client sends lastSeq in the connection handshake. The server replays all messages with seq > lastSeq from its buffer.

Layer 2: Server-side message buffer — Store unacked messages per client in Redis:

ZADD ws:messages:{client_id} {seq} {serialized_message}

Sorted set with seq as score. On reconnect:

ZRANGEBYSCORE ws:messages:{client_id} {lastSeq+1} +inf

Replay everything above the client’s last seen. Expire the sorted set with a TTL (e.g. 24 hours) to prevent unbounded growth.

Layer 3: Client-side ack — Client sends ack after processing each message:

{ "type": "ack", "seq": 4821 }

Server removes acked messages from the buffer:

ZREMRANGEBYSCORE ws:messages:{client_id} -inf {acked_seq}

Layer 4: Persistent notification store (the safety net) — WebSocket is the fast path. The database is the source of truth.

CREATE TABLE notifications (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     INT NOT NULL,
  event_id    TEXT NOT NULL UNIQUE,  -- idempotency
  type        TEXT NOT NULL,
  payload     JSONB NOT NULL,
  read_at     TIMESTAMPTZ,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Every notification is written to this table (via outbox pattern) before being pushed via WebSocket. The WebSocket push is an optimization. If the user missed it, they catch up via:

GET /notifications?since={lastSeenId}

Called on reconnect, page load, or after a gap in connectivity.

Layer 5: The golden rule — “WebSocket as hint, REST as truth”

The most resilient pattern: the WebSocket push contains minimal data:

{ "seq": 4821, "type": "order.confirmed", "entity_id": "ord_123" }

The client receives this and fetches the authoritative state:

GET /orders/ord_123

Even if the WebSocket message is lost, duplicated, or delivered out of order, the client always fetches current state from the API. The UI reflects what the database says, not what the WebSocket said. The WebSocket is just a wake-up call.

The Saga Pattern — For Long-Running Distributed Operations

When you have multiple steps across multiple services and you need each step to be reliable, the Saga pattern is the standard approach.

Each saga step has a forward transaction (the action) and a compensating transaction (the undo).

Step 1: Reserve inventory    | Compensating: Release inventory
Step 2: Charge payment       | Compensating: Refund payment
Step 3: Create shipment      | Compensating: Cancel shipment
Step 4: Send confirmation    | Compensating: Send cancellation

If step 3 fails, run compensating transactions for steps 2 and 1 (in reverse order). Step 4 is not a symmetric undo: you can’t un-send a confirmation email over SMTP. The “Send cancellation” compensating action is a follow-up business message — not a rollback of the original send. Some steps are only compensatable at the product level, not as a true rewind.

Saga implementations:

  • Choreography-based: Each service publishes events and other services react. No central coordinator. Simpler to deploy, harder to debug. The saga state is implicit in the event flow.
  • Orchestration-based: A central saga orchestrator sends commands to each service and receives responses. Easier to trace, single point of failure (mitigate with persistent orchestrator state).

The key insight: within each saga step, you still need exactly-once (or idempotent) behavior. The saga pattern doesn’t eliminate the per-step problem — it organizes the failure recovery across steps.

Part 4 — Observability: How to Know When Your Guarantees Are Lying

You can’t fix what you can’t see. Here’s what to instrument.

Duplicate rate metric:

duplicate_events_total{topic="orders.created", consumer="fulfillment"}

This should be close to zero in steady state. A spike means either a consumer restart happened, or your idempotency logic has a bug.

Outbox lag:

outbox_unprocessed_count
outbox_oldest_unprocessed_age_seconds

If outbox rows are old and unprocessed, your poller is down or falling behind. Alert on this.

Consumer lag:

kafka_consumer_lag{topic="orders.created", consumer_group="fulfillment"}

High lag + duplicate events = your consumer is slow and reprocessing. Check idempotency implementation.

WebSocket message loss rate:

ws_message_sent_total - ws_message_acked_total = unacked messages

If this grows: clients are disconnecting without acking. Check reconnect logic and message buffer expiry.

Dead letter queue depth:

dlq_message_count{source_topic="orders.created"}

Messages in DLQ have failed processing N times. Alert and investigate.

The Honest Summary

Here’s what your architecture actually has:

BoundaryGuaranteeWhat you need
Postgres transactionExactly-once (ACID)Nothing extra — it’s real
Kafka → Kafka (EOS enabled)Exactly-onceTransactional API + idempotent producer
DB → KafkaAt-least-onceOutbox pattern + idempotent consumer
DB → WebSocketAt-most-once (default)Seq numbers + buffer + REST fallback
HTTP → External APIAt-most-once (default)Idempotency keys + retry logic
Kafka consumer → DBAt-least-onceIdempotent writes (upsert / ON CONFLICT)
Service A → Service B (HTTP)At-most-once (default)Idempotency keys end-to-end
Redis pub/sub → subscriberAt-most-onceDon’t use for reliable delivery

What You Should Write in Your Design Docs

Stop writing “we have exactly-once delivery.” Write something like:

“This flow crosses two component boundaries: DB → outbox poller → Kafka. The DB write and outbox insert are atomic. The outbox poller delivers at-least-once to Kafka. Consumers handle duplicates via idempotency key table (processed_events). Expected duplicate rate under steady state: ~0%. Under consumer restart: <0.01% of events may be processed twice, safely.”

That’s a sentence that describes a real system. It names the boundaries. It names the guarantee at each boundary. It names the mechanism that handles failure. It gives you a number.

The goal is not to achieve exactly-once everywhere. The goal is to know exactly what you have, design every consumer to be safe under it, and instrument everything so you know when it’s not working.

That’s how distributed systems that actually hold up in production are built.