Slack Sends Messages Over HTTP — Not Its WebSocket The durability bug that forced the split, and the design rule that came out of it Slack client desktop · mobile · web Slack backend edge · API · message store SEND · HTTPS POST chat.postMessage Status code · idempotency key · retry RECEIVE · WebSocket frames (RTM-style) Real-time hint · ephemeral · no app-layer ack Why the split? Sends must be durable: HTTP gives you status codes, idempotency, retry, load balancing. Receives can be lossy: WebSocket gives you low-latency push, with REST as the source of truth.

There is a fact about Slack’s architecture that surprises almost everyone who learns it for the first time:

When you press Enter on a Slack message, that message is not sent over the WebSocket connection your client is holding open. It is sent over a separate HTTPS POST request to chat.postMessage.

Slack has a WebSocket. It uses it heavily. But for the one operation you would most expect a WebSocket to handle — the user actually sending a message — Slack reaches past it and goes to HTTP.

This isn’t a quirk. It’s a deliberate design decision that came out of a class of durability bug that bit early real-time products hard. This post explains exactly what that bug looks like, why WebSocket sends are fundamentally not the right tool for the job, what HTTP gives you that the WebSocket protocol does not, and the receive-side migration (RTM API → Events API) that made the same architectural rule public for every app developer.

Part 1 — What people think Slack does vs. what it actually does

The intuitive mental model of a real-time chat app is something like this:

[Client] ──── WebSocket (everything) ────► [Server]
        ◄─── WebSocket (everything) ─────

One pipe. Bidirectional. Sends go up the pipe, events come back down the pipe. Simple, low-latency, “real-time.”

Slack’s actual model splits the two directions:

                       SEND PATH
[Client] ─── HTTPS POST /api/chat.postMessage ───► [Slack API]
         ◄── HTTP 200 { ok:true, ts, channel } ──

                      RECEIVE PATH
[Client] ◄── WebSocket frames (events) ──── [Slack edge/gateway]
          (typing, new messages from others, reactions, presence…)

The send and receive paths use different transports because they have different requirements. The send path needs durability and acknowledgement. The receive path needs low-latency push. One transport is good at exactly one of those things.

Two transports, two jobs SEND · HTTPS POST chat.postMessage, reactions.add, etc. POST /api/chat.postMessage 200 { ok:true, ts:"…", channel:"…" } Status code = explicit ack Idempotency key · retryable · LB per request RECEIVE · WebSocket other people’s messages, typing, presence wss://… (long-lived) {"type":"message", "text":"hi", …} Low-latency push, fire-and-forget REST is the source of truth on reconnect

This split shows up in two other places at Slack too:

  • Bots and apps. Slack’s original Real Time Messaging (RTM) API delivered events to bots over a WebSocket. In 2017, Slack introduced the Events API, which delivers events to bot servers over HTTPS POST webhooks. RTM is now strongly discouraged for new apps. The official guidance: receive events via HTTP webhooks, take actions via HTTP API calls.
  • Slash commands & interactivity. When a user runs a slash command or clicks a button in a Slack message, Slack invokes your app via an HTTPS POST — not over any open WebSocket your app might have. The pattern is consistent: actions that must not be lost go over HTTP.
One nuance up front. Slack also offers Socket Mode (2020), which lets a bot tunnel the Events API over a WebSocket — useful when your server cannot accept public webhooks (corporate firewalls, laptops). Socket Mode is a transport workaround, not a return to fire-and-forget delivery: Slack still treats each event as an HTTP-style request and expects an acknowledgement. We’ll come back to it.

Part 2 — Why the obvious answer (“send over the WebSocket”) is wrong

If you already have a WebSocket open, sending the message over it looks like the cheapest, fastest option. Zero new handshake, zero extra round trips, the connection is right there. So why doesn’t Slack do that?

Because WebSocket.send() does not mean what you think it means.

What WebSocket.send() actually returns

In the browser, WebSocket.send(data) returns undefined. It can throw if the connection is already closed. But if it returns normally, here is what it has actually guaranteed:

  1. The data has been queued into the WebSocket’s internal buffer in the browser.

That’s it. That’s the whole guarantee.

It has not guaranteed that the bytes left your machine. The browser hands them down to the kernel’s TCP send buffer, which hands them down to the NIC, which puts them on the wire — eventually. You can inspect WebSocket.bufferedAmount to see how many bytes are still sitting in the WebSocket’s queue waiting to be handed to TCP, but even bufferedAmount === 0 only tells you the browser is done with them. The kernel might still be holding them.

And even when the bytes have reached the server’s kernel, the server process may not have read them yet. And even when the server process has read them, it may not have parsed the frame, validated it, persisted it, or fanned it out to other clients.

Between WebSocket.send() and "the message is saved" Five buffers. Any of them can drop the message if the connection dies. Browser WS bufferedAmount Client TCP SO_SNDBUF Network in-flight packets Server TCP SO_RCVBUF App process parse · validate · persist RFC 6455 has no application-layer "data delivered" frame WebSocket gives you BINARY/TEXT frames, PING/PONG (liveness), CLOSE (handshake). There is no protocol-level ack that says "message #42 was received and processed."

The RFC 6455 gap

The WebSocket protocol (RFC 6455) defines six frame types: continuation, text, binary, close, ping, and pong. None of them is an application-layer acknowledgement.

TCP underneath provides byte-level reliability — bytes arrive in order, with retransmission. But TCP’s ACKs are between kernels, not between applications. When the client kernel’s TCP stack sees an ACK come back, all it learned is that the receiving kernel got the bytes. The receiving application process might not have called recv() yet. It might never call recv() because it just crashed.

HTTP’s genius is that the response is the application-layer ack. A 200 OK comes back only after the server’s handler ran. A 5xx means the server tried and failed. A timeout means you do not know. Each request is a small, scoped contract.

WebSocket has no equivalent. You can build one on top — most production systems do — but every team builds it differently, and getting it right is the same amount of work as just using HTTP.

Part 3 — The durability bug, walked through carefully

Let’s play out exactly what goes wrong when you send a chat message over a WebSocket without an application-layer ack.

The setup

A naive design:

// Client
ws.send(JSON.stringify({
  type: "message",
  channel: "C123",
  text: "Shipping the release in 5 min",
}));
// UI immediately shows the message as "sent"
// Server (Node-ish pseudocode)
ws.on("message", async (raw) => {
  const msg = JSON.parse(raw);
  await db.insert("messages", msg);
  await fanout.broadcast(msg.channel, msg);
});

It works the majority of the time. The bug lives in the unhappy paths.

Failure mode 1: the connection dies between send and persist

[Client]
  ws.send(msg)                              ✓ queued in browser buffer
  TCP hands bytes to kernel                 ✓
  bytes leave NIC                           ✓
  ── packet in flight ──
                                  💥 server process OOM-killed
                                     between recv() and db.insert
  TCP eventually RSTs the socket
  Client gets a "close" event
  msg was never written to the message store

From the user’s perspective, the message looks sent — the UI showed it instantly. From the server’s perspective, the message never existed. Refresh the channel: it’s gone. Talk to the recipient: they never got it.

This is the bug. The client thought it shipped a durable message. The server made no such promise. There was no ack to wait for, so the client never knew the difference.

Failure mode 2: persist succeeded, the “you’re sent” signal was lost

[Client]
  ws.send(msg)
                                  server recv() ✓
                                  db.insert ✓
                                  fanout.broadcast ✓
                                  server sends ack frame
  ── ack packet drops ──
  Network blips, socket closes
  Client never sees the ack
  Client reconnects, retries the send
                                  server recv() ✓
                                  db.insert  → duplicate row (or duplicate fanout)

Same physics as the “two generals” problem. Without an idempotency key the client can attach to the second attempt, the server has no way to tell “this is the retry of the message you already saved” apart from “this is a brand new message that happens to look similar.” Two copies in the channel. Two notifications on the recipient’s phone. Probably a support ticket.

Failure mode 3: half-open socket, send disappears into the void

WebSocket connections sit on top of TCP, and TCP’s ESTABLISHED state lies. A NAT or load balancer can drop the flow mapping after an idle period (AWS ALB defaults to 60s of idle). Neither end finds out until something tries to send.

[Client]                              [LB]                       [Server]
  WebSocket ESTABLISHED  ─────────  flow mapping  ─────────  ESTABLISHED
                                    💥 evicted after 60s idle
  ws.send("important msg")
  Bytes go into TCP send buffer
  Retransmit, retransmit, retransmit…
  Eventually RST, ~15 min later by default
  Client app may not surface that for minutes

The message looks like it was sent. There is no error event. The user moves on. Fifteen minutes later the client surfaces a connection error — but the message that was “sent” is gone, because it was buffered into a socket whose other end no longer existed.

You can layer application heartbeats on top to catch this faster (and you should — see TCP Keep-Alive vs Application Heartbeat). But none of that gets you per-message durability. It just gets you faster detection that the pipe is broken.

Failure mode 4: server fanned out before persisting

To minimise latency, an early implementation might broadcast the message to other clients first, then write it to the database. If the database write fails after the fanout, other people in the channel saw a message that doesn’t exist. If they reply, their reply references a parent message ID that’s not in the store. If they leave and re-open the channel, the original message is gone but the reply is there.

HTTP’s request/response shape pushes you toward the right order naturally: you cannot return 200 ok until persistence succeeded, so the “message visible” signal is gated on durability. WebSocket has no such structural pressure — the temptation to fanout first is real, and several real-time apps have fallen into it.

Part 4 — What HTTP gives you that raw WebSocket does not

Now flip the design. The send path goes through HTTPS:

// Client
const res = await fetch("/api/chat.postMessage", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${token}`,
  },
  body: JSON.stringify({
    channel: "C123",
    text: "Shipping the release in 5 min",
    client_msg_id: "5b3c…-uuid",  // idempotency key
  }),
});
const { ok, ts, channel } = await res.json();

Look at what you get for free with this shape:

HTTP request/response = a small durability contract Status codes 2xx = persisted & visible 4xx = client bug, don’t retry 5xx = retry with backoff timeout = ambiguous → key Idempotency key client_msg_id in payload retried POST = same key server dedups in message store at-least-once → effectively once L7 load balancing each POST routed independently no sticky session per user deploy/restart doesn’t kill sends scales horizontally trivially Standard middleware CDN edge auth, rate-limit, WAF retry-after, deprecation headers tracing, structured logs per req no custom framing to maintain Independent timeouts per-request deadline slow send doesn’t hang others retry budget per call circuit breakers compose naturally Explicit failure surface no "silent fire-and-forget" UI can show "sending…" honestly red badge on real failure no "looks sent but isn’t"

The idempotency key, in detail

The Slack message payload includes a client_msg_id — a UUID generated by the client. The server stores it alongside the message row. If the client retries (because the first POST timed out and the client doesn’t know whether it succeeded), the second POST carries the same client_msg_id. The server sees a uniqueness conflict and returns the existing message’s metadata (the ts assigned the first time) instead of inserting a duplicate.

CREATE TABLE messages (
  ts             TEXT PRIMARY KEY,           -- server-assigned
  channel        TEXT NOT NULL,
  user_id        TEXT NOT NULL,
  text           TEXT NOT NULL,
  client_msg_id  TEXT NOT NULL,
  UNIQUE (channel, client_msg_id)            -- dedup key
);

The retry is safe. The UI behaviour is: “Sending…” → either “Sent” (200, possibly after one transparent retry) or “Failed, tap to retry” (4xx/5xx after backoff). The user never sees a phantom message.

You could implement the same scheme over WebSocket. The framing differences don’t matter — JSON is JSON. But once you’ve added per-message IDs, an ack frame, retry-with-backoff, server-side dedup, and a state machine for in-flight messages, you have rebuilt a worse HTTP. With more bugs.

Load balancing is the other half

WebSockets are a load-balancing problem. The connection is sticky to one backend for its entire lifetime — minutes, hours, sometimes days. Deploying a new version means either gracefully draining millions of connections (slow), or just dropping them (chaos). A surge of new users hitting the same edge means rebalancing connections you can’t easily move.

HTTP requests are short-lived and stateless. Each chat.postMessage POST can be routed to whichever message-service instance is least loaded. A deploy is a rolling restart with zero user-visible impact. Capacity scales linearly with request rate, not connection count.

Putting the send path on HTTP means the most important operation in the product runs on the easiest-to-operate transport. The hard transport (WebSocket) is reserved for the use case that genuinely needs it: low-latency push of other people’s events.

Part 5 — The receive side: RTM → Events API, the same lesson in public

For Slack’s own clients, the receive side is still a WebSocket — and it’s fine, because the receive side is a hint, not a source of truth. (We’ll come back to why this is fine in Part 6.)

But for bots and third-party apps, Slack made the same architectural decision visible to the outside world. The history is worth tracing.

RTM API (legacy) — WebSocket events to your bot

The original way to write a Slack bot: call rtm.connect, get back a wss:// URL, hold the WebSocket open, and receive a stream of event JSON. Every message in every channel your bot was a member of came down that pipe.

The problems were the same ones that motivated the send-side decision:

  • If your bot was disconnected, events were lost. Slack maintained a small replay window after reconnect, but it was best-effort. A long disconnect = missing events.
  • No per-event acknowledgement. If your bot received an event but crashed before processing it, Slack had no way to know — no retry.
  • Hard to operate. A bot serving many workspaces had to hold many WebSocket connections, manage reconnect storms, handle backpressure manually.
  • One bug took down everything. A bug in your event loop could starve every event for every workspace, with no isolation per request.

Events API (2017) — HTTP webhooks to your bot

The replacement: Slack POSTs each event to an HTTPS endpoint you register. Your endpoint returns 200. If you return anything else, or time out (3 seconds), Slack retries — up to 3 retries spread over an hour, with backoff. Every retry carries an X-Slack-Retry-Num header so you can tell. Every event has a unique event_id so you can deduplicate.

RTM (WebSocket) vs Events API (HTTP) — delivery semantics RTM API · WebSocket "Hold this socket forever and listen" • At-most-once delivery • No per-event ack • Disconnect = events lost • Reconnect storms on deploy • N connections per N workspaces • Now strongly discouraged Events API · HTTPS POST "We’ll POST to your URL, you ack with 200" • At-least-once delivery • 200 OK = explicit ack • 3 retries / 1 hour with backoff • event_id for client-side dedup • X-Slack-Retry-Num on retries • Scales with HTTP infra you already have

This is the same trade Slack made on the send side, made public for app developers: important deliveries go over HTTP. The transport with the built-in ack semantics, the per-request scope, the standard middleware. WebSocket is reserved for the cases where push latency matters more than per-event durability.

Socket Mode (2020) — the nuance

Socket Mode came back because some apps cannot expose a public HTTPS endpoint — a CLI tool running on a laptop, a corporate bot stuck behind a firewall. So Slack opens a WebSocket back to your app and tunnels Events API messages through it. The wire is a WebSocket, but the contract is still HTTP-shaped: each event has an envelope_id, your app must send back an explicit ack, and unacked events are retried.

That distinction is the whole point of this post. The transport is not the architecture. Slack didn’t go back to fire-and-forget WebSocket — they kept the per-message acknowledgement model and just changed how the bytes move. The bug they were guarding against (silent message loss) is bug at the application protocol layer, not the transport layer, and they fixed it at the right layer.

Part 6 — “WebSocket as hint, REST as truth”

If sends are over HTTP, what is the WebSocket actually carrying in a Slack client?

  • Other people’s messages. The fanout push.
  • Typing indicators, presence changes, reactions, edits. Low-value, high-frequency hints where loss is acceptable.
  • Push wake-ups. "Something changed in channel C123 — go re-fetch if you care."

The key design pattern: the WebSocket message is a hint, the REST API is the truth. The client treats every event as “go check.” On reconnect, the client asks for state since its last known cursor:

GET /api/conversations.history?channel=C123&oldest=1717070400.000123

If the WebSocket missed three events while the laptop was suspended, the client doesn’t care — the catch-up REST call fills in the gap. The UI shows whatever the REST call returns. The WebSocket just shortens the window between “something happened” and “the UI knows.”

This is the pattern that survives a flaky network, a server restart, a deploy, a Wi-Fi-to-cellular handoff, and a closed laptop lid. It’s the pattern Slack converged on; it’s the pattern Discord, Linear, Notion, Figma, and Google Docs all use; and it’s the pattern that scales because the expensive operations (durable writes) run on the cheap-to-operate transport (HTTP), and the cheap operations (hints) run on the expensive-to-operate transport (WebSocket).

WebSocket as hint · REST as truth Client WS edge (hint) REST API (truth) connect / event push GET /conversations.history?oldest=cursor On WS reconnect 1. Client reconnects WebSocket, ignores whatever it missed 2. Client calls REST API with its last-known cursor 3. UI reflects what REST returned — WS gap is invisible to user

Part 7 — When you should (and shouldn’t) do this

The takeaway is not “WebSockets are bad” or “always use HTTP.” It’s that sends and receives have different reliability requirements, and one transport rarely satisfies both.

OperationTransportWhy
User-initiated write that must persistHTTP POST + idempotency keyStatus code is the ack; retries are safe; LB is trivial
Server-to-client push (new events from others)WebSocketLow latency, fan-out friendly, loss tolerable if REST is truth
Server-to-server delivery of important eventsHTTP webhook with retry + dedupPer-event ack, standard middleware, no sticky connection
Presence, typing, cursors, ephemeral hintsWebSocketHigh frequency, loss is fine, no durability required
Catch-up after disconnectREST query with cursorAuthoritative source of truth, gaps fill themselves
Streaming bulk data (telemetry, video frames)WebSocket or HTTP/2 streamsContinuous flow, individual loss often acceptable

If you find yourself reaching for a WebSocket for the send path because “the connection is already there,” stop and ask:

  • What happens if the connection drops between send() and the server’s persist?
  • How does the client know the message was actually saved?
  • How do you retry a send safely without duplicates?
  • What does the UI show in the ambiguous case where you sent and don’t know if it landed?

If your answers are vague, you have the same bug Slack’s early architecture had. The fix is not to make the WebSocket smarter. The fix is to send the message over a transport that already has an ack built into it.

The honest one-liner

Slack sends over HTTP not because HTTP is faster (it isn’t), and not because WebSockets are unreliable at the transport level (TCP is fine). It sends over HTTP because the HTTP request/response shape forces an application-layer acknowledgement, an idempotency model, and a stateless load-balancing story that a raw WebSocket leaves you to invent yourself. For the one operation in the product where losing a message is unacceptable — the user pressing Enter — the cheapest correct answer is to use the protocol that was already designed for it.

The WebSocket is still there. It is doing the job it’s good at: pushing other people’s activity to your client with low latency, so the UI feels alive. The REST API is doing the job it’s good at: being the source of truth that the client trusts when the WebSocket misses something.

Two transports. Two jobs. One product that doesn’t lose your messages.