There is a fact about Slack’s architecture that surprises almost everyone who learns it for the first time:
When you press Enter on a Slack message, that message is not sent over the WebSocket connection your client is holding open. It is sent over a separate HTTPS POST request to chat.postMessage.
Slack has a WebSocket. It uses it heavily. But for the one operation you would most expect a WebSocket to handle — the user actually sending a message — Slack reaches past it and goes to HTTP.
This isn’t a quirk. It’s a deliberate design decision that came out of a class of durability bug that bit early real-time products hard. This post explains exactly what that bug looks like, why WebSocket sends are fundamentally not the right tool for the job, what HTTP gives you that the WebSocket protocol does not, and the receive-side migration (RTM API → Events API) that made the same architectural rule public for every app developer.
Part 1 — What people think Slack does vs. what it actually does
The intuitive mental model of a real-time chat app is something like this:
[Client] ──── WebSocket (everything) ────► [Server]
◄─── WebSocket (everything) ─────
One pipe. Bidirectional. Sends go up the pipe, events come back down the pipe. Simple, low-latency, “real-time.”
Slack’s actual model splits the two directions:
SEND PATH
[Client] ─── HTTPS POST /api/chat.postMessage ───► [Slack API]
◄── HTTP 200 { ok:true, ts, channel } ──
RECEIVE PATH
[Client] ◄── WebSocket frames (events) ──── [Slack edge/gateway]
(typing, new messages from others, reactions, presence…)
The send and receive paths use different transports because they have different requirements. The send path needs durability and acknowledgement. The receive path needs low-latency push. One transport is good at exactly one of those things.
This split shows up in two other places at Slack too:
- Bots and apps. Slack’s original Real Time Messaging (RTM) API delivered events to bots over a WebSocket. In 2017, Slack introduced the Events API, which delivers events to bot servers over HTTPS POST webhooks. RTM is now strongly discouraged for new apps. The official guidance: receive events via HTTP webhooks, take actions via HTTP API calls.
- Slash commands & interactivity. When a user runs a slash command or clicks a button in a Slack message, Slack invokes your app via an HTTPS POST — not over any open WebSocket your app might have. The pattern is consistent: actions that must not be lost go over HTTP.
Part 2 — Why the obvious answer (“send over the WebSocket”) is wrong
If you already have a WebSocket open, sending the message over it looks like the cheapest, fastest option. Zero new handshake, zero extra round trips, the connection is right there. So why doesn’t Slack do that?
Because WebSocket.send() does not mean what you think it means.
What WebSocket.send() actually returns
In the browser, WebSocket.send(data) returns undefined. It can throw if the connection is already closed. But if it returns normally, here is what it has actually guaranteed:
- The data has been queued into the WebSocket’s internal buffer in the browser.
That’s it. That’s the whole guarantee.
It has not guaranteed that the bytes left your machine. The browser hands them down to the kernel’s TCP send buffer, which hands them down to the NIC, which puts them on the wire — eventually. You can inspect WebSocket.bufferedAmount to see how many bytes are still sitting in the WebSocket’s queue waiting to be handed to TCP, but even bufferedAmount === 0 only tells you the browser is done with them. The kernel might still be holding them.
And even when the bytes have reached the server’s kernel, the server process may not have read them yet. And even when the server process has read them, it may not have parsed the frame, validated it, persisted it, or fanned it out to other clients.
The RFC 6455 gap
The WebSocket protocol (RFC 6455) defines six frame types: continuation, text, binary, close, ping, and pong. None of them is an application-layer acknowledgement.
TCP underneath provides byte-level reliability — bytes arrive in order, with retransmission. But TCP’s ACKs are between kernels, not between applications. When the client kernel’s TCP stack sees an ACK come back, all it learned is that the receiving kernel got the bytes. The receiving application process might not have called recv() yet. It might never call recv() because it just crashed.
HTTP’s genius is that the response is the application-layer ack. A 200 OK comes back only after the server’s handler ran. A 5xx means the server tried and failed. A timeout means you do not know. Each request is a small, scoped contract.
WebSocket has no equivalent. You can build one on top — most production systems do — but every team builds it differently, and getting it right is the same amount of work as just using HTTP.
Part 3 — The durability bug, walked through carefully
Let’s play out exactly what goes wrong when you send a chat message over a WebSocket without an application-layer ack.
The setup
A naive design:
// Client
ws.send(JSON.stringify({
type: "message",
channel: "C123",
text: "Shipping the release in 5 min",
}));
// UI immediately shows the message as "sent"
// Server (Node-ish pseudocode)
ws.on("message", async (raw) => {
const msg = JSON.parse(raw);
await db.insert("messages", msg);
await fanout.broadcast(msg.channel, msg);
});
It works the majority of the time. The bug lives in the unhappy paths.
Failure mode 1: the connection dies between send and persist
[Client]
ws.send(msg) ✓ queued in browser buffer
TCP hands bytes to kernel ✓
bytes leave NIC ✓
── packet in flight ──
💥 server process OOM-killed
between recv() and db.insert
TCP eventually RSTs the socket
Client gets a "close" event
msg was never written to the message store
From the user’s perspective, the message looks sent — the UI showed it instantly. From the server’s perspective, the message never existed. Refresh the channel: it’s gone. Talk to the recipient: they never got it.
This is the bug. The client thought it shipped a durable message. The server made no such promise. There was no ack to wait for, so the client never knew the difference.
Failure mode 2: persist succeeded, the “you’re sent” signal was lost
[Client]
ws.send(msg)
server recv() ✓
db.insert ✓
fanout.broadcast ✓
server sends ack frame
── ack packet drops ──
Network blips, socket closes
Client never sees the ack
Client reconnects, retries the send
server recv() ✓
db.insert → duplicate row (or duplicate fanout)
Same physics as the “two generals” problem. Without an idempotency key the client can attach to the second attempt, the server has no way to tell “this is the retry of the message you already saved” apart from “this is a brand new message that happens to look similar.” Two copies in the channel. Two notifications on the recipient’s phone. Probably a support ticket.
Failure mode 3: half-open socket, send disappears into the void
WebSocket connections sit on top of TCP, and TCP’s ESTABLISHED state lies. A NAT or load balancer can drop the flow mapping after an idle period (AWS ALB defaults to 60s of idle). Neither end finds out until something tries to send.
[Client] [LB] [Server]
WebSocket ESTABLISHED ───────── flow mapping ───────── ESTABLISHED
💥 evicted after 60s idle
ws.send("important msg")
Bytes go into TCP send buffer
Retransmit, retransmit, retransmit…
Eventually RST, ~15 min later by default
Client app may not surface that for minutes
The message looks like it was sent. There is no error event. The user moves on. Fifteen minutes later the client surfaces a connection error — but the message that was “sent” is gone, because it was buffered into a socket whose other end no longer existed.
You can layer application heartbeats on top to catch this faster (and you should — see TCP Keep-Alive vs Application Heartbeat). But none of that gets you per-message durability. It just gets you faster detection that the pipe is broken.
Failure mode 4: server fanned out before persisting
To minimise latency, an early implementation might broadcast the message to other clients first, then write it to the database. If the database write fails after the fanout, other people in the channel saw a message that doesn’t exist. If they reply, their reply references a parent message ID that’s not in the store. If they leave and re-open the channel, the original message is gone but the reply is there.
HTTP’s request/response shape pushes you toward the right order naturally: you cannot return 200 ok until persistence succeeded, so the “message visible” signal is gated on durability. WebSocket has no such structural pressure — the temptation to fanout first is real, and several real-time apps have fallen into it.
Part 4 — What HTTP gives you that raw WebSocket does not
Now flip the design. The send path goes through HTTPS:
// Client
const res = await fetch("/api/chat.postMessage", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${token}`,
},
body: JSON.stringify({
channel: "C123",
text: "Shipping the release in 5 min",
client_msg_id: "5b3c…-uuid", // idempotency key
}),
});
const { ok, ts, channel } = await res.json();
Look at what you get for free with this shape:
The idempotency key, in detail
The Slack message payload includes a client_msg_id — a UUID generated by the client. The server stores it alongside the message row. If the client retries (because the first POST timed out and the client doesn’t know whether it succeeded), the second POST carries the same client_msg_id. The server sees a uniqueness conflict and returns the existing message’s metadata (the ts assigned the first time) instead of inserting a duplicate.
CREATE TABLE messages (
ts TEXT PRIMARY KEY, -- server-assigned
channel TEXT NOT NULL,
user_id TEXT NOT NULL,
text TEXT NOT NULL,
client_msg_id TEXT NOT NULL,
UNIQUE (channel, client_msg_id) -- dedup key
);
The retry is safe. The UI behaviour is: “Sending…” → either “Sent” (200, possibly after one transparent retry) or “Failed, tap to retry” (4xx/5xx after backoff). The user never sees a phantom message.
You could implement the same scheme over WebSocket. The framing differences don’t matter — JSON is JSON. But once you’ve added per-message IDs, an ack frame, retry-with-backoff, server-side dedup, and a state machine for in-flight messages, you have rebuilt a worse HTTP. With more bugs.
Load balancing is the other half
WebSockets are a load-balancing problem. The connection is sticky to one backend for its entire lifetime — minutes, hours, sometimes days. Deploying a new version means either gracefully draining millions of connections (slow), or just dropping them (chaos). A surge of new users hitting the same edge means rebalancing connections you can’t easily move.
HTTP requests are short-lived and stateless. Each chat.postMessage POST can be routed to whichever message-service instance is least loaded. A deploy is a rolling restart with zero user-visible impact. Capacity scales linearly with request rate, not connection count.
Putting the send path on HTTP means the most important operation in the product runs on the easiest-to-operate transport. The hard transport (WebSocket) is reserved for the use case that genuinely needs it: low-latency push of other people’s events.
Part 5 — The receive side: RTM → Events API, the same lesson in public
For Slack’s own clients, the receive side is still a WebSocket — and it’s fine, because the receive side is a hint, not a source of truth. (We’ll come back to why this is fine in Part 6.)
But for bots and third-party apps, Slack made the same architectural decision visible to the outside world. The history is worth tracing.
RTM API (legacy) — WebSocket events to your bot
The original way to write a Slack bot: call rtm.connect, get back a wss:// URL, hold the WebSocket open, and receive a stream of event JSON. Every message in every channel your bot was a member of came down that pipe.
The problems were the same ones that motivated the send-side decision:
- If your bot was disconnected, events were lost. Slack maintained a small replay window after reconnect, but it was best-effort. A long disconnect = missing events.
- No per-event acknowledgement. If your bot received an event but crashed before processing it, Slack had no way to know — no retry.
- Hard to operate. A bot serving many workspaces had to hold many WebSocket connections, manage reconnect storms, handle backpressure manually.
- One bug took down everything. A bug in your event loop could starve every event for every workspace, with no isolation per request.
Events API (2017) — HTTP webhooks to your bot
The replacement: Slack POSTs each event to an HTTPS endpoint you register. Your endpoint returns 200. If you return anything else, or time out (3 seconds), Slack retries — up to 3 retries spread over an hour, with backoff. Every retry carries an X-Slack-Retry-Num header so you can tell. Every event has a unique event_id so you can deduplicate.
This is the same trade Slack made on the send side, made public for app developers: important deliveries go over HTTP. The transport with the built-in ack semantics, the per-request scope, the standard middleware. WebSocket is reserved for the cases where push latency matters more than per-event durability.
Socket Mode (2020) — the nuance
Socket Mode came back because some apps cannot expose a public HTTPS endpoint — a CLI tool running on a laptop, a corporate bot stuck behind a firewall. So Slack opens a WebSocket back to your app and tunnels Events API messages through it. The wire is a WebSocket, but the contract is still HTTP-shaped: each event has an envelope_id, your app must send back an explicit ack, and unacked events are retried.
That distinction is the whole point of this post. The transport is not the architecture. Slack didn’t go back to fire-and-forget WebSocket — they kept the per-message acknowledgement model and just changed how the bytes move. The bug they were guarding against (silent message loss) is bug at the application protocol layer, not the transport layer, and they fixed it at the right layer.
Part 6 — “WebSocket as hint, REST as truth”
If sends are over HTTP, what is the WebSocket actually carrying in a Slack client?
- Other people’s messages. The fanout push.
- Typing indicators, presence changes, reactions, edits. Low-value, high-frequency hints where loss is acceptable.
- Push wake-ups. "Something changed in channel C123 — go re-fetch if you care."
The key design pattern: the WebSocket message is a hint, the REST API is the truth. The client treats every event as “go check.” On reconnect, the client asks for state since its last known cursor:
GET /api/conversations.history?channel=C123&oldest=1717070400.000123
If the WebSocket missed three events while the laptop was suspended, the client doesn’t care — the catch-up REST call fills in the gap. The UI shows whatever the REST call returns. The WebSocket just shortens the window between “something happened” and “the UI knows.”
This is the pattern that survives a flaky network, a server restart, a deploy, a Wi-Fi-to-cellular handoff, and a closed laptop lid. It’s the pattern Slack converged on; it’s the pattern Discord, Linear, Notion, Figma, and Google Docs all use; and it’s the pattern that scales because the expensive operations (durable writes) run on the cheap-to-operate transport (HTTP), and the cheap operations (hints) run on the expensive-to-operate transport (WebSocket).
Part 7 — When you should (and shouldn’t) do this
The takeaway is not “WebSockets are bad” or “always use HTTP.” It’s that sends and receives have different reliability requirements, and one transport rarely satisfies both.
| Operation | Transport | Why |
|---|---|---|
| User-initiated write that must persist | HTTP POST + idempotency key | Status code is the ack; retries are safe; LB is trivial |
| Server-to-client push (new events from others) | WebSocket | Low latency, fan-out friendly, loss tolerable if REST is truth |
| Server-to-server delivery of important events | HTTP webhook with retry + dedup | Per-event ack, standard middleware, no sticky connection |
| Presence, typing, cursors, ephemeral hints | WebSocket | High frequency, loss is fine, no durability required |
| Catch-up after disconnect | REST query with cursor | Authoritative source of truth, gaps fill themselves |
| Streaming bulk data (telemetry, video frames) | WebSocket or HTTP/2 streams | Continuous flow, individual loss often acceptable |
If you find yourself reaching for a WebSocket for the send path because “the connection is already there,” stop and ask:
- What happens if the connection drops between
send()and the server’s persist? - How does the client know the message was actually saved?
- How do you retry a send safely without duplicates?
- What does the UI show in the ambiguous case where you sent and don’t know if it landed?
If your answers are vague, you have the same bug Slack’s early architecture had. The fix is not to make the WebSocket smarter. The fix is to send the message over a transport that already has an ack built into it.
The honest one-liner
Slack sends over HTTP not because HTTP is faster (it isn’t), and not because WebSockets are unreliable at the transport level (TCP is fine). It sends over HTTP because the HTTP request/response shape forces an application-layer acknowledgement, an idempotency model, and a stateless load-balancing story that a raw WebSocket leaves you to invent yourself. For the one operation in the product where losing a message is unacceptable — the user pressing Enter — the cheapest correct answer is to use the protocol that was already designed for it.
The WebSocket is still there. It is doing the job it’s good at: pushing other people’s activity to your client with low latency, so the UI feels alive. The REST API is doing the job it’s good at: being the source of truth that the client trusts when the WebSocket misses something.
Two transports. Two jobs. One product that doesn’t lose your messages.