Slack Sends Messages Over HTTP, Not Its WebSocket — and the Durability Bug That Forced the Switch

Tags: WebSockets & Real-time · Real Problems · Distributed Systems · HTTP · Reading time: ~20 min · Category: Real Problems → WebSockets & Real-time

There is a fact about Slack’s architecture that surprises almost everyone who learns it for the first time:

When you press Enter on a Slack message, that message is not sent over the WebSocket connection your client is holding open. It is sent over a separate HTTPS POST request to chat.postMessage.

Slack has a WebSocket. It uses it heavily. But for the one operation you would most expect a WebSocket to handle — the user actually sending a message — Slack reaches past it and goes to HTTP.

This isn’t a quirk. It’s a deliberate design decision that came out of a class of durability bug that bit early real-time products hard. This post explains exactly what that bug looks like, why WebSocket sends are fundamentally not the right tool for the job, what HTTP gives you that the WebSocket protocol does not, and the receive-side migration (RTM API → Events API) that made the same architectural rule public for every app developer.

Part 1 — What people think Slack does vs. what it actually does

The intuitive mental model of a real-time chat app is something like this:

[Client] ──── WebSocket (everything) ────► [Server]
        ◄─── WebSocket (everything) ─────

One pipe. Bidirectional. Sends go up the pipe, events come back down the pipe. Simple, low-latency, “real-time.”

Slack’s actual model splits the two directions:

                       SEND PATH
[Client] ─── HTTPS POST /api/chat.postMessage ───► [Slack API]
         ◄── HTTP 200 { ok:true, ts, channel } ──

                      RECEIVE PATH
[Client] ◄── WebSocket frames (events) ──── [Slack edge/gateway]
          (typing, new messages from others, reactions, presence…)

The send and receive paths use different transports because they have different requirements. The send path needs durability and acknowledgement. The receive path needs low-latency push. One transport is good at exactly one of those things.

This split shows up in two other places at Slack too:

Bots and apps. Slack’s original Real Time Messaging (RTM) API delivered events to bots over a WebSocket. In 2017, Slack introduced the Events API, which delivers events to bot servers over HTTPS POST webhooks. RTM is now strongly discouraged for new apps. The official guidance: receive events via HTTP webhooks, take actions via HTTP API calls.
Slash commands & interactivity. When a user runs a slash command or clicks a button in a Slack message, Slack invokes your app via an HTTPS POST — not over any open WebSocket your app might have. The pattern is consistent: actions that must not be lost go over HTTP.

One nuance up front. Slack also offers Socket Mode (2020), which lets a bot tunnel the Events API over a WebSocket — useful when your server cannot accept public webhooks (corporate firewalls, laptops). Socket Mode is a transport workaround, not a return to fire-and-forget delivery: Slack still treats each event as an HTTP-style request and expects an acknowledgement. We’ll come back to it.

Part 2 — Why the obvious answer (“send over the WebSocket”) is wrong

If you already have a WebSocket open, sending the message over it looks like the cheapest, fastest option. Zero new handshake, zero extra round trips, the connection is right there. So why doesn’t Slack do that?

Because WebSocket.send() does not mean what you think it means.

What `WebSocket.send()` actually returns

In the browser, WebSocket.send(data) returns undefined. It can throw if the connection is already closed. But if it returns normally, here is what it has actually guaranteed:

The data has been queued into the WebSocket’s internal buffer in the browser.

That’s it. That’s the whole guarantee.

It has not guaranteed that the bytes left your machine. The browser hands them down to the kernel’s TCP send buffer, which hands them down to the NIC, which puts them on the wire — eventually. You can inspect WebSocket.bufferedAmount to see how many bytes are still sitting in the WebSocket’s queue waiting to be handed to TCP, but even bufferedAmount === 0 only tells you the browser is done with them. The kernel might still be holding them.

And even when the bytes have reached the server’s kernel, the server process may not have read them yet. And even when the server process has read them, it may not have parsed the frame, validated it, persisted it, or fanned it out to other clients.

The RFC 6455 gap

The WebSocket protocol (RFC 6455) defines six frame types: continuation, text, binary, close, ping, and pong. None of them is an application-layer acknowledgement.

TCP underneath provides byte-level reliability — bytes arrive in order, with retransmission. But TCP’s ACKs are between kernels, not between applications. When the client kernel’s TCP stack sees an ACK come back, all it learned is that the receiving kernel got the bytes. The receiving application process might not have called recv() yet. It might never call recv() because it just crashed.

HTTP’s genius is that the response is the application-layer ack. A 200 OK comes back only after the server’s handler ran. A 5xx means the server tried and failed. A timeout means you do not know. Each request is a small, scoped contract.

WebSocket has no equivalent. You can build one on top — most production systems do — but every team builds it differently, and getting it right is the same amount of work as just using HTTP.

Part 3 — The durability bug, walked through carefully

Let’s play out exactly what goes wrong when you send a chat message over a WebSocket without an application-layer ack.

The setup

A naive design:

// Client
ws.send(JSON.stringify({
  type: "message",
  channel: "C123",
  text: "Shipping the release in 5 min",
}));
// UI immediately shows the message as "sent"

// Server (Node-ish pseudocode)
ws.on("message", async (raw) => {
  const msg = JSON.parse(raw);
  await db.insert("messages", msg);
  await fanout.broadcast(msg.channel, msg);
});

It works the majority of the time. The bug lives in the unhappy paths.

Failure mode 1: the connection dies between send and persist

[Client]
  ws.send(msg)                              ✓ queued in browser buffer
  TCP hands bytes to kernel                 ✓
  bytes leave NIC                           ✓
  ── packet in flight ──
                                  💥 server process OOM-killed
                                     between recv() and db.insert
  TCP eventually RSTs the socket
  Client gets a "close" event
  msg was never written to the message store

From the user’s perspective, the message looks sent — the UI showed it instantly. From the server’s perspective, the message never existed. Refresh the channel: it’s gone. Talk to the recipient: they never got it.

This is the bug. The client thought it shipped a durable message. The server made no such promise. There was no ack to wait for, so the client never knew the difference.

Failure mode 2: persist succeeded, the “you’re sent” signal was lost

[Client]
  ws.send(msg)
                                  server recv() ✓
                                  db.insert ✓
                                  fanout.broadcast ✓
                                  server sends ack frame
  ── ack packet drops ──
  Network blips, socket closes
  Client never sees the ack
  Client reconnects, retries the send
                                  server recv() ✓
                                  db.insert  → duplicate row (or duplicate fanout)

Same physics as the “two generals” problem. Without an idempotency key the client can attach to the second attempt, the server has no way to tell “this is the retry of the message you already saved” apart from “this is a brand new message that happens to look similar.” Two copies in the channel. Two notifications on the recipient’s phone. Probably a support ticket.

Failure mode 3: half-open socket, send disappears into the void

WebSocket connections sit on top of TCP, and TCP’s ESTABLISHED state lies. A NAT or load balancer can drop the flow mapping after an idle period (AWS ALB defaults to 60s of idle). Neither end finds out until something tries to send.

[Client]                              [LB]                       [Server]
  WebSocket ESTABLISHED  ─────────  flow mapping  ─────────  ESTABLISHED
                                    💥 evicted after 60s idle
  ws.send("important msg")
  Bytes go into TCP send buffer
  Retransmit, retransmit, retransmit…
  Eventually RST, ~15 min later by default
  Client app may not surface that for minutes

The message looks like it was sent. There is no error event. The user moves on. Fifteen minutes later the client surfaces a connection error — but the message that was “sent” is gone, because it was buffered into a socket whose other end no longer existed.

You can layer application heartbeats on top to catch this faster (and you should — see TCP Keep-Alive vs Application Heartbeat). But none of that gets you per-message durability. It just gets you faster detection that the pipe is broken.

Failure mode 4: server fanned out before persisting

To minimise latency, an early implementation might broadcast the message to other clients first, then write it to the database. If the database write fails after the fanout, other people in the channel saw a message that doesn’t exist. If they reply, their reply references a parent message ID that’s not in the store. If they leave and re-open the channel, the original message is gone but the reply is there.

HTTP’s request/response shape pushes you toward the right order naturally: you cannot return 200 ok until persistence succeeded, so the “message visible” signal is gated on durability. WebSocket has no such structural pressure — the temptation to fanout first is real, and several real-time apps have fallen into it.

Part 4 — What HTTP gives you that raw WebSocket does not

Now flip the design. The send path goes through HTTPS:

// Client
const res = await fetch("/api/chat.postMessage", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${token}`,
  },
  body: JSON.stringify({
    channel: "C123",
    text: "Shipping the release in 5 min",
    client_msg_id: "5b3c…-uuid",  // idempotency key
  }),
});
const { ok, ts, channel } = await res.json();

Look at what you get for free with this shape:

The idempotency key, in detail

The Slack message payload includes a client_msg_id — a UUID generated by the client. The server stores it alongside the message row. If the client retries (because the first POST timed out and the client doesn’t know whether it succeeded), the second POST carries the same client_msg_id. The server sees a uniqueness conflict and returns the existing message’s metadata (the ts assigned the first time) instead of inserting a duplicate.

CREATE TABLE messages (
  ts             TEXT PRIMARY KEY,           -- server-assigned
  channel        TEXT NOT NULL,
  user_id        TEXT NOT NULL,
  text           TEXT NOT NULL,
  client_msg_id  TEXT NOT NULL,
  UNIQUE (channel, client_msg_id)            -- dedup key
);

The retry is safe. The UI behaviour is: “Sending…” → either “Sent” (200, possibly after one transparent retry) or “Failed, tap to retry” (4xx/5xx after backoff). The user never sees a phantom message.

You could implement the same scheme over WebSocket. The framing differences don’t matter — JSON is JSON. But once you’ve added per-message IDs, an ack frame, retry-with-backoff, server-side dedup, and a state machine for in-flight messages, you have rebuilt a worse HTTP. With more bugs.

Load balancing is the other half

WebSockets are a load-balancing problem. The connection is sticky to one backend for its entire lifetime — minutes, hours, sometimes days. Deploying a new version means either gracefully draining millions of connections (slow), or just dropping them (chaos). A surge of new users hitting the same edge means rebalancing connections you can’t easily move.

HTTP requests are short-lived and stateless. Each chat.postMessage POST can be routed to whichever message-service instance is least loaded. A deploy is a rolling restart with zero user-visible impact. Capacity scales linearly with request rate, not connection count.

Putting the send path on HTTP means the most important operation in the product runs on the easiest-to-operate transport. The hard transport (WebSocket) is reserved for the use case that genuinely needs it: low-latency push of other people’s events.

Part 5 — The receive side: RTM → Events API, the same lesson in public

For Slack’s own clients, the receive side is still a WebSocket — and it’s fine, because the receive side is a hint, not a source of truth. (We’ll come back to why this is fine in Part 6.)

But for bots and third-party apps, Slack made the same architectural decision visible to the outside world. The history is worth tracing.

RTM API (legacy) — WebSocket events to your bot

The original way to write a Slack bot: call rtm.connect, get back a wss:// URL, hold the WebSocket open, and receive a stream of event JSON. Every message in every channel your bot was a member of came down that pipe.

The problems were the same ones that motivated the send-side decision:

If your bot was disconnected, events were lost. Slack maintained a small replay window after reconnect, but it was best-effort. A long disconnect = missing events.
No per-event acknowledgement. If your bot received an event but crashed before processing it, Slack had no way to know — no retry.
Hard to operate. A bot serving many workspaces had to hold many WebSocket connections, manage reconnect storms, handle backpressure manually.
One bug took down everything. A bug in your event loop could starve every event for every workspace, with no isolation per request.

Events API (2017) — HTTP webhooks to your bot

The replacement: Slack POSTs each event to an HTTPS endpoint you register. Your endpoint returns 200. If you return anything else, or time out (3 seconds), Slack retries — up to 3 retries spread over an hour, with backoff. Every retry carries an X-Slack-Retry-Num header so you can tell. Every event has a unique event_id so you can deduplicate.

This is the same trade Slack made on the send side, made public for app developers: important deliveries go over HTTP. The transport with the built-in ack semantics, the per-request scope, the standard middleware. WebSocket is reserved for the cases where push latency matters more than per-event durability.

Socket Mode (2020) — the nuance

Socket Mode came back because some apps cannot expose a public HTTPS endpoint — a CLI tool running on a laptop, a corporate bot stuck behind a firewall. So Slack opens a WebSocket back to your app and tunnels Events API messages through it. The wire is a WebSocket, but the contract is still HTTP-shaped: each event has an envelope_id, your app must send back an explicit ack, and unacked events are retried.

That distinction is the whole point of this post. The transport is not the architecture. Slack didn’t go back to fire-and-forget WebSocket — they kept the per-message acknowledgement model and just changed how the bytes move. The bug they were guarding against (silent message loss) is bug at the application protocol layer, not the transport layer, and they fixed it at the right layer.

Part 6 — “WebSocket as hint, REST as truth”

If sends are over HTTP, what is the WebSocket actually carrying in a Slack client?

Other people’s messages. The fanout push.
Typing indicators, presence changes, reactions, edits. Low-value, high-frequency hints where loss is acceptable.
Push wake-ups. "Something changed in channel C123 — go re-fetch if you care."

The key design pattern: the WebSocket message is a hint, the REST API is the truth. The client treats every event as “go check.” On reconnect, the client asks for state since its last known cursor:

GET /api/conversations.history?channel=C123&oldest=1717070400.000123

If the WebSocket missed three events while the laptop was suspended, the client doesn’t care — the catch-up REST call fills in the gap. The UI shows whatever the REST call returns. The WebSocket just shortens the window between “something happened” and “the UI knows.”

This is the pattern that survives a flaky network, a server restart, a deploy, a Wi-Fi-to-cellular handoff, and a closed laptop lid. It’s the pattern Slack converged on; it’s the pattern Discord, Linear, Notion, Figma, and Google Docs all use; and it’s the pattern that scales because the expensive operations (durable writes) run on the cheap-to-operate transport (HTTP), and the cheap operations (hints) run on the expensive-to-operate transport (WebSocket).

Part 7 — When you should (and shouldn’t) do this

The takeaway is not “WebSockets are bad” or “always use HTTP.” It’s that sends and receives have different reliability requirements, and one transport rarely satisfies both.

Operation	Transport	Why
User-initiated write that must persist	HTTP POST + idempotency key	Status code is the ack; retries are safe; LB is trivial
Server-to-client push (new events from others)	WebSocket	Low latency, fan-out friendly, loss tolerable if REST is truth
Server-to-server delivery of important events	HTTP webhook with retry + dedup	Per-event ack, standard middleware, no sticky connection
Presence, typing, cursors, ephemeral hints	WebSocket	High frequency, loss is fine, no durability required
Catch-up after disconnect	REST query with cursor	Authoritative source of truth, gaps fill themselves
Streaming bulk data (telemetry, video frames)	WebSocket or HTTP/2 streams	Continuous flow, individual loss often acceptable

If you find yourself reaching for a WebSocket for the send path because “the connection is already there,” stop and ask:

What happens if the connection drops between send() and the server’s persist?
How does the client know the message was actually saved?
How do you retry a send safely without duplicates?
What does the UI show in the ambiguous case where you sent and don’t know if it landed?

If your answers are vague, you have the same bug Slack’s early architecture had. The fix is not to make the WebSocket smarter. The fix is to send the message over a transport that already has an ack built into it.

The honest one-liner

Slack sends over HTTP not because HTTP is faster (it isn’t), and not because WebSockets are unreliable at the transport level (TCP is fine). It sends over HTTP because the HTTP request/response shape forces an application-layer acknowledgement, an idempotency model, and a stateless load-balancing story that a raw WebSocket leaves you to invent yourself. For the one operation in the product where losing a message is unacceptable — the user pressing Enter — the cheapest correct answer is to use the protocol that was already designed for it.

The WebSocket is still there. It is doing the job it’s good at: pushing other people’s activity to your client with low latency, so the UI feels alive. The REST API is doing the job it’s good at: being the source of truth that the client trusts when the WebSocket misses something.

Two transports. Two jobs. One product that doesn’t lose your messages.