TCP Keep-Alive vs Application Heartbeat — Three Different Things Called “Keep-Alive”

The naming catastrophe — three things called "keep-alive"

Before anything else: the word "keep-alive" gets used for three completely different things. Most arguments about which is "better" are actually people talking past each other.

For the rest of this post, "keep-alive" means the TCP socket option (the middle column). The HTTP header is mentioned only to get it out of the way:

About the HTTP header. When a browser sends Connection: keep-alive, it is asking the server to not close the TCP socket after this response so the next request can reuse it. That avoids a fresh 3-way handshake (and TLS handshake) for every request. In HTTP/1.1 this is the default, in HTTP/2 the connection is multiplexed and reused by design. None of this has anything to do with detecting whether the peer is alive.

How TCP actually tracks "connected"

To understand why long-lived sockets die silently, you need to know one uncomfortable truth about TCP: it has no built-in liveness check by default.

The 3-way handshake establishes a connection — SYN, SYN-ACK, ACK. After that, both sides have a record of the connection, identified by the 4-tuple (src IP, src port, dst IP, dst port). They each track sequence numbers so data isn't reordered or duplicated. The OS marks the socket as ESTABLISHED.

And that's it. The socket can sit in ESTABLISHED forever, even if the other machine has been unplugged for a week. TCP only learns about a problem when something tries to send — then either the peer's TCP stack replies with RST (if it's still around but doesn't recognize the connection), or the local TCP retransmits and eventually times out.

This is the gap that both TCP keep-alive and application heartbeats are trying to close, in different ways and at different layers.

The half-open socket — how connections silently die

"Half-open" is the term for a TCP connection where one side believes the connection is alive and the other does not (or no longer exists). Four common ways this happens in production:

Peer machine power-cut or kernel panic. No FIN is ever sent. Your side never finds out unless it tries to send.
NAT or stateful firewall drops the flow mapping. Home routers typically expire idle TCP flows around 5 minutes. AWS NLB defaults to 350 seconds. AWS ALB defaults to 60 seconds. Once the mapping is dropped, packets between the two sides get blackholed (NLB) or RST (ALB).
Network path change. Mobile client hands off Wi-Fi to cellular. The 4-tuple's source IP changes, the old socket is orphaned, and the server keeps a zombie.
Middlebox idle-flow eviction. Stateful firewalls and load balancers cap the number of concurrent flows they track. Idle ones get evicted first — silently.

If you SSH into a server running a long-lived socket service and run ss -tan, you'll see something like:

$ ss -tan state established | wc -l
83214

$ ss -tan state established '( dport = :443 )' | head -5
ESTAB  0  0  10.0.1.4:43221  10.0.2.7:443
ESTAB  0  0  10.0.1.4:43227  10.0.2.7:443
ESTAB  0  0  10.0.1.4:43231  10.0.2.7:443

83,214 ESTABLISHED sockets. The kernel is happy. How many of those have a peer that will ever speak again? The kernel has no idea. It has not tried to send anything, so it has not noticed.

TCP keep-alive — what the kernel actually does

When you set SO_KEEPALIVE on a socket, the kernel periodically sends a probe packet on idle connections. The probe is a strange little thing: a TCP segment with no payload and a sequence number set to current_seq - 1. The peer's stack sees a duplicate ACK request and answers with the current ACK. If the peer is gone, no answer comes; after enough silence, the kernel declares the connection dead and surfaces an error to your app on the next read or write.

Three knobs control this on Linux:

# Defaults on most Linux distros
$ sysctl net.ipv4.tcp_keepalive_time      # 7200    (idle seconds before first probe)
$ sysctl net.ipv4.tcp_keepalive_intvl     # 75      (seconds between probes)
$ sysctl net.ipv4.tcp_keepalive_probes    # 9       (failed probes before giving up)

Do the math: a freshly-broken connection takes 7200 + (9 × 75) = 7875 seconds, or about 2 hours and 11 minutes, to be detected. That is the default. For anything that matters, you must override per-socket:

// Node.js — first probe after 30s idle
socket.setKeepAlive(true, 30_000);

// Go — first probe after 30s idle
tcpConn.SetKeepAlive(true)
tcpConn.SetKeepAlivePeriod(30 * time.Second)

(Note: setKeepAlive in most high-level runtimes only exposes the idle time, not the probe interval or count. To tune those you use setsockopt with TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT directly.)

What TCP keep-alive does NOT detect: application deadlock, GC pause, a worker thread stuck in a slow query, an event loop blocked on a CPU-bound task, or any application-layer protocol that's wedged. The kernel's TCP stack is alive and answering probes — but your code might be hung. Kernel responding ≠ app responding.

Application heartbeat — what your code does

An application heartbeat is just a message your protocol defines — sent on a timer, expecting a reply on a timer. The crucial difference from TCP keep-alive is that the heartbeat traverses your application code. To answer it, the peer's event loop must spin, the message must be parsed, and a reply must be written. If the peer process is hung, deadlocked, or mid-GC for too long, the heartbeat goes unanswered — and that's exactly what you wanted to detect.

Three patterns cover almost every case:

Ping/pong. Built into the WebSocket protocol (RFC 6455 §5.5.2). The server sends a ping frame; the client's WebSocket library auto-replies with a pong frame. If a pong doesn't arrive in time, the server closes the socket.
Periodic empty message. MQTT's PINGREQ/PINGRESP; Kafka's consumer group heartbeat thread. The protocol defines a no-op message specifically for liveness.
Read-deadline reset. Every successful read pushes a deadline forward. If the deadline expires with no data, kill the socket. Common in Go (conn.SetReadDeadline) and gRPC (which has its own keepalive subsystem layered on top of HTTP/2).

The canonical Node.js ws library pattern looks like this:

// Server-side heartbeat — the pattern from the ws README
const wss = new WebSocketServer({ port: 8080 });

function heartbeat() {
  this.isAlive = true;
}

wss.on('connection', (ws) => {
  ws.isAlive = true;
  ws.on('pong', heartbeat);   // client replied — mark alive
});

const interval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (ws.isAlive === false) return ws.terminate(); // missed last round
    ws.isAlive = false;
    ws.ping();                  // send ping; pong handler resets the flag
  });
}, 30_000);

Two things to notice. First, terminate(), not close() — the latter waits for a graceful close handshake the peer can no longer participate in. Second, the design tolerates exactly one missed round before killing the socket, so a single dropped packet doesn't trigger a disconnect.

The decision framework

This is the load-bearing section. The right answer almost always depends on what's between you and the other side — and what kind of failure you actually need to catch.

Scenario	TCP keep-alive	App heartbeat	Why
Internal service-to-service, fast LAN	Sometimes	Rarely	Connections are short-lived; a failed write surfaces RST quickly
HTTP/1.1 keep-alive reuse over LB	No	No	LB idle-timeout governs; tune the connection pool's max-idle and reaping
Long-lived gRPC streams	Yes (~10s)	Yes (gRPC keepalive)	gRPC has its own keepalive layer over HTTP/2; tune both
WebSockets through CDN / NAT	Optional	Required	CDN/NAT silently drops idle flows; ping interval must be < their idle timeout
MQTT IoT fleet	No	Required	Spec mandates `PINGREQ`; keep-alive value is negotiated at `CONNECT`
DB connection pool	Yes (30–60s)	Sometimes (`SELECT 1`)	Cheap detection of stale pool entries before a real query hits one
Behind a strict corporate firewall	Required	Required	Firewalls drop both kinds; pick whichever the firewall allows

Four rules of thumb:

"Is the route alive?" → TCP keep-alive.
"Is the peer process alive and processing?" → application heartbeat.
"Is there a NAT, LB, or firewall in the middle with an idle timeout?" → application heartbeat at an interval comfortably below that timeout.
"Could my app GC-pause for 30s under load?" → tune heartbeat tolerance (how many misses before close), not just frequency. Otherwise a stop-the-world pause kills every healthy connection at once.

The belt-and-suspenders move. For long-lived sockets through middleboxes, configure both: an application heartbeat at, say, 25 seconds (well under typical NAT/LB idle timeouts), and TCP keep-alive at 30–60 seconds as a safety net. The heartbeat catches app-level failures and keeps the flow mapping warm; the kernel's keep-alive catches things that crashed the heartbeat thread itself.

Cost — why you can't just heartbeat every second

Heartbeats look cheap and they mostly are — until they aren't.

Bandwidth math. 100,000 connections × one ~60-byte heartbeat every 30 seconds = ~200 KB/s on the wire. Trivial. Drop the interval to 1 second: 6 MB/s. Still fine for a single host on a 10G NIC.

The real cost isn't bandwidth. It's wakeups. Every heartbeat is a timer firing, an event loop iteration, a syscall to write a few bytes, plus the syscall on the read side when the reply arrives. 100,000 connections at 1Hz heartbeat = 100,000 timer wakeups per second on each side, plus the inverse storm of replies. CPU goes up, latency-sensitive work suffers.

If you need sub-second heartbeats at scale, batch them into a timer wheel (Netty's HashedWheelTimer is the canonical implementation) so a single timer tick wakes up many connections at once. Otherwise, keep the interval as long as the slowest middlebox in your path will tolerate.

Reference cheatsheet

Where	Knob	Default	What it does
Linux kernel	`net.ipv4.tcp_keepalive_time`	7200s	Idle seconds before first probe
Linux kernel	`net.ipv4.tcp_keepalive_intvl`	75s	Seconds between probes
Linux kernel	`net.ipv4.tcp_keepalive_probes`	9	Failed probes before drop
Node.js	`socket.setKeepAlive(true, ms)`	off	Per-socket idle time
Go	`conn.SetKeepAlivePeriod(d)`	15s on dialer	Per-socket idle time
Java/Netty	`ChannelOption.SO_KEEPALIVE`	off	Enables kernel keep-alive on channel
nginx upstream	`keepalive_time`, `keepalive_timeout`	1h / 60s	Idle pool reuse window
AWS ALB	idle timeout	60s	Drops idle TCP flows; need heartbeat < 60s
AWS NLB	idle timeout	350s	Same, but at L4 — silent blackhole
WebSocket (RFC 6455)	ping/pong frames	off	Application-layer heartbeat at the protocol level
MQTT	`Keep Alive` in CONNECT	0 (off)	Negotiated PINGREQ interval

The lesson

"Connected" is a lie your kernel tells you by default.

TCP doesn't probe. NAT boxes evict idle flows. Load balancers drop sockets after 60 seconds. Your peer process can hang while its kernel cheerfully answers probes. Pick the layer that answers the question you actually care about — route liveness or peer liveness — and for anything long-lived, configure both. The cheapest debugging session is the one you avoid by setting two timers correctly the first time.