If a tree falls in a forest and no one hears it, did it make a sound? Distributed systems face a crueler version every second: a server stops answering you — did it die, or did your message just get lost on the way? You can’t tell. Not “it’s hard” — you provably cannot tell from where you’re standing.
So the one truth this whole post hangs on: “I can’t reach it” is not the same as “it’s dead.” One is about the network; the other is about the node. Treating them as the same is how a healthy cluster talks itself into a failover at 3 AM. Every system you’ve shipped has a tiny subsystem that makes this call — and the naive version, “send a ping, no reply in a few seconds, mark it dead,” is wrong in a few interesting ways. Two words fix it: suspect and corroborate.
The lie in “no reply means dead”
Here’s the detector almost everyone writes first:
async function isAlive(peer) {
try {
await ping(peer, { timeoutMs: 3000 });
return true;
} catch {
return false; // <-- "no reply in 3s, therefore dead"
}
}
The bug isn’t the code — it’s the word therefore. A missing reply has at least four explanations, and you can’t tell them apart:
- Dead — the process crashed and will never answer.
- Slow — it’s alive but mid-GC-pause or CPU-starved; the reply is coming.
- Unreachable from you — a partition ate your packet; other nodes still reach it fine.
- Ack lost — it answered; the answer never made it back.
That’s why the literature is picky about words: a process that truly stopped is dead; one you only think stopped is suspected. A failure detector never gives you the first — only ever the second.
This is just FLP impossibility in work clothes: in an asynchronous system you can’t reliably tell a crashed node from a slow one, so no protocol can guarantee consensus. The practical fix is to accept a detector that’s allowed to be wrong — and that turns out to be enough to build real systems on.
The trade-off you can’t escape
Judge any detector on two things in tension: completeness (every dead node is eventually suspected) and accuracy (live nodes aren’t falsely accused). The provable catch: you can’t be both perfectly fast and perfectly accurate. Tight timeouts catch crashes quickly but accuse nodes that merely paused; loose timeouts are accurate but slow to react.
Pings, heartbeats, and their blind spot
Two ways to check a peer, differing only in direction. A ping: you ask “are you there?” and wait for an ack. A heartbeat: the peer keeps sending “still here,” and you watch for its absence. Both land in the same place — a table of peers with a last-seen time, anyone too quiet gets suspected. Akka’s deadline detector is the textbook example.
The catch with this whole family: it captures one node’s point of view. P1 deciding P2 is dead says nothing about whether P3 can still reach P2. Fixing that blind spot is what everything below is about.
SWIM: don’t take your own word for it
The single best idea to steal. When your direct ping to a peer goes unanswered, don’t conclude anything — outsource the question:
- P1 pings P2. No reply.
- P1 picks a few random members (P3, P4) and asks each: “ping P2 for me.”
- If any of them gets a reply, it relays the ack back to P1.
- Only if every indirect probe also fails does P1 suspect P2.
Now P2 is judged from several vantage points, so a broken P1→P2 link stops masquerading as a dead P2 — and the probes fan out in parallel, so the answer is fast. Each node only needs to know a subset of peers, which is why it scales. Serf and Consul’s membership are built on this.
Phi-accrual: “alive” isn’t a boolean
Every detector so far ends in up/down. The phi-accrual (φ) detector (used by Cassandra and Akka) outputs a number instead: a suspicion level that rises the longer a node is silent. It keeps a sliding window of recent heartbeat gaps, learns what “normal” spacing looks like, and asks how surprising the current silence is given that history. When φ crosses a threshold, the node is suspected.
The win is self-tuning. On a tight datacenter link, a 2-second gap is shocking and φ spikes at once. On a flaky cross-region link where arrivals already scatter, the same gap is unremarkable and φ barely moves. You stop hand-picking one timeout for all weather and instead pick a confidence level, letting the math adapt the effective timeout to conditions.
Gossip: aggregate the whole cluster’s view
SWIM gets a few extra opinions; gossip-style detection makes everyone’s view of everyone available without anyone broadcasting to all. Each node keeps a table of heartbeat counters. Periodically it bumps its own counter and sends the table to one random peer, who merges in the higher counters. Counters spread hop by hop like an infection; a node whose counter hasn’t advanced anywhere for long enough is declared failed.
The payoff is reliability through redundancy: one broken link can’t fake a death, because the counter still arrives by another route, and the cluster’s view is an aggregate that’s hard to fool. The cost is more messages — but gossip’s bandwidth per node is bounded and grows only linearly with cluster size.
FUSE: propagate failure with silence
One neat inversion. Instead of sending news of a failure, FUSE spreads it by stopping. Group the processes; everyone pings everyone. The rule: the moment a process notices a peer go silent, it stops responding too. Silence is contagious — non-response cascades like dominoes until the whole group has gone quiet, and by then every member has learned the group failed. It’s guaranteed and cheap, since silence can’t itself be partitioned away. The trade-off is bluntness: one cut-off node can escalate to a whole-group failure — sometimes a bug, sometimes exactly the fan-out you want.
Mapping it to systems you run
| Approach | Idea | Where you’ve met it |
|---|---|---|
| Ping / heartbeat + timeout | One node’s up/down verdict | Akka deadline detector · k8s liveness probes · LB health checks |
| SWIM (indirect probes) | Get a second opinion before accusing | Serf · Consul membership |
| Phi-accrual (φ) | Continuous, self-tuning suspicion level | Cassandra · Akka cluster |
| Gossip heartbeat table | Aggregate cluster-wide view | Dynamo-style rings |
| FUSE / quiescence | Propagate failure by going silent | Group-failure escalation |
The takeaway
“Send a ping, no reply, mark it dead” isn’t a failure detector — it’s the cheapest one, pinned to the most trigger-happy corner of the curve with exactly one opinion behind it. It works until the day a GC pause or a one-sided partition talks your cluster into a failover. Two truths replace it:
- You can only suspect, never know. So aim for an honest, tunable confidence level — which is exactly what phi-accrual makes explicit.
- One node’s view is the weakest evidence. So corroborate — outsource the probe (SWIM), aggregate the cluster (gossip), or let silence carry it (FUSE).
Then pick your spot deliberately. Cheap false positive (drop a cache replica, retry elsewhere)? Lean fast. Expensive one (fence a primary, trigger an election)? Lean accurate and demand corroboration. The subsystem that quietly decides who’s alive deserves as much care as the consensus protocol on top of it — that protocol is only ever as good as the suspicions you feed it.