Is It Dead, Or Just Slow? A failure detector never knows. It only ever suspects. NAIVE: one ping, one timeout P1 P2 ping (silence) timeout → "DEAD" (maybe wrong) BETTER: ask the neighbours P1 P2 P3 Many views — fewer false alarms Failure detection: trading how fast you accuse against how often you're wrong

If a tree falls in a forest and no one hears it, did it make a sound? Distributed systems face a crueler version every second: a server stops answering you — did it die, or did your message just get lost on the way? You can’t tell. Not “it’s hard” — you provably cannot tell from where you’re standing.

So the one truth this whole post hangs on: “I can’t reach it” is not the same as “it’s dead.” One is about the network; the other is about the node. Treating them as the same is how a healthy cluster talks itself into a failover at 3 AM. Every system you’ve shipped has a tiny subsystem that makes this call — and the naive version, “send a ping, no reply in a few seconds, mark it dead,” is wrong in a few interesting ways. Two words fix it: suspect and corroborate.

The lie in “no reply means dead”

Here’s the detector almost everyone writes first:

async function isAlive(peer) {
  try {
    await ping(peer, { timeoutMs: 3000 });
    return true;
  } catch {
    return false; // <-- "no reply in 3s, therefore dead"
  }
}

The bug isn’t the code — it’s the word therefore. A missing reply has at least four explanations, and you can’t tell them apart:

  • Dead — the process crashed and will never answer.
  • Slow — it’s alive but mid-GC-pause or CPU-starved; the reply is coming.
  • Unreachable from you — a partition ate your packet; other nodes still reach it fine.
  • Ack lost — it answered; the answer never made it back.

That’s why the literature is picky about words: a process that truly stopped is dead; one you only think stopped is suspected. A failure detector never gives you the first — only ever the second.

Why this matters. Act on “suspected” as if it were “dead” — fence the node, reassign its shards, promote a new leader — and if you were wrong you’ve triggered a needless failover and maybe a split brain. False positives aren’t free; they’re how a healthy cluster takes itself down.

This is just FLP impossibility in work clothes: in an asynchronous system you can’t reliably tell a crashed node from a slow one, so no protocol can guarantee consensus. The practical fix is to accept a detector that’s allowed to be wrong — and that turns out to be enough to build real systems on.

The trade-off you can’t escape

Judge any detector on two things in tension: completeness (every dead node is eventually suspected) and accuracy (live nodes aren’t falsely accused). The provable catch: you can’t be both perfectly fast and perfectly accurate. Tight timeouts catch crashes quickly but accuse nodes that merely paused; loose timeouts are accurate but slow to react.

Mental model. It’s one slider: fast ⟷ accurate. The naive detector nails that slider to a single timeout. Everything clever below just makes the slider adaptive — to network conditions and to what other nodes are seeing.

Pings, heartbeats, and their blind spot

Two ways to check a peer, differing only in direction. A ping: you ask “are you there?” and wait for an ack. A heartbeat: the peer keeps sending “still here,” and you watch for its absence. Both land in the same place — a table of peers with a last-seen time, anyone too quiet gets suspected. Akka’s deadline detector is the textbook example.

The catch with this whole family: it captures one node’s point of view. P1 deciding P2 is dead says nothing about whether P3 can still reach P2. Fixing that blind spot is what everything below is about.

SWIM: don’t take your own word for it

The single best idea to steal. When your direct ping to a peer goes unanswered, don’t conclude anything — outsource the question:

  1. P1 pings P2. No reply.
  2. P1 picks a few random members (P3, P4) and asks each: “ping P2 for me.”
  3. If any of them gets a reply, it relays the ack back to P1.
  4. Only if every indirect probe also fails does P1 suspect P2.
SWIM: one failed ping is not a verdict P1 P2 P3 P4 1. direct ping — no reply 2. "ping P2 for me" 3. indirect probe If any indirect probe replies, P2 was never dead — just unreachable from P1

Now P2 is judged from several vantage points, so a broken P1→P2 link stops masquerading as a dead P2 — and the probes fan out in parallel, so the answer is fast. Each node only needs to know a subset of peers, which is why it scales. Serf and Consul’s membership are built on this.

The portable lesson. Before acting on a suspicion, get a second opinion from a node that isn’t you. One failed health check is a question; three independent ones agreeing is an answer.

Phi-accrual: “alive” isn’t a boolean

Every detector so far ends in up/down. The phi-accrual (φ) detector (used by Cassandra and Akka) outputs a number instead: a suspicion level that rises the longer a node is silent. It keeps a sliding window of recent heartbeat gaps, learns what “normal” spacing looks like, and asks how surprising the current silence is given that history. When φ crosses a threshold, the node is suspected.

The win is self-tuning. On a tight datacenter link, a 2-second gap is shocking and φ spikes at once. On a flaky cross-region link where arrivals already scatter, the same gap is unremarkable and φ barely moves. You stop hand-picking one timeout for all weather and instead pick a confidence level, letting the math adapt the effective timeout to conditions.

Gossip: aggregate the whole cluster’s view

SWIM gets a few extra opinions; gossip-style detection makes everyone’s view of everyone available without anyone broadcasting to all. Each node keeps a table of heartbeat counters. Periodically it bumps its own counter and sends the table to one random peer, who merges in the higher counters. Counters spread hop by hop like an infection; a node whose counter hasn’t advanced anywhere for long enough is declared failed.

Broken P1–P3 link? P3's pulse still routes through P2 P1 P2 P3 gossip ok gossip ok direct P1–P3 link down

The payoff is reliability through redundancy: one broken link can’t fake a death, because the counter still arrives by another route, and the cluster’s view is an aggregate that’s hard to fool. The cost is more messages — but gossip’s bandwidth per node is bounded and grows only linearly with cluster size.

FUSE: propagate failure with silence

One neat inversion. Instead of sending news of a failure, FUSE spreads it by stopping. Group the processes; everyone pings everyone. The rule: the moment a process notices a peer go silent, it stops responding too. Silence is contagious — non-response cascades like dominoes until the whole group has gone quiet, and by then every member has learned the group failed. It’s guaranteed and cheap, since silence can’t itself be partitioned away. The trade-off is bluntness: one cut-off node can escalate to a whole-group failure — sometimes a bug, sometimes exactly the fan-out you want.

Mapping it to systems you run

ApproachIdeaWhere you’ve met it
Ping / heartbeat + timeoutOne node’s up/down verdictAkka deadline detector · k8s liveness probes · LB health checks
SWIM (indirect probes)Get a second opinion before accusingSerf · Consul membership
Phi-accrual (φ)Continuous, self-tuning suspicion levelCassandra · Akka cluster
Gossip heartbeat tableAggregate cluster-wide viewDynamo-style rings
FUSE / quiescencePropagate failure by going silentGroup-failure escalation

The takeaway

“Send a ping, no reply, mark it dead” isn’t a failure detector — it’s the cheapest one, pinned to the most trigger-happy corner of the curve with exactly one opinion behind it. It works until the day a GC pause or a one-sided partition talks your cluster into a failover. Two truths replace it:

  1. You can only suspect, never know. So aim for an honest, tunable confidence level — which is exactly what phi-accrual makes explicit.
  2. One node’s view is the weakest evidence. So corroborate — outsource the probe (SWIM), aggregate the cluster (gossip), or let silence carry it (FUSE).

Then pick your spot deliberately. Cheap false positive (drop a cache replica, retry elsewhere)? Lean fast. Expensive one (fence a primary, trigger an election)? Lean accurate and demand corroboration. The subsystem that quietly decides who’s alive deserves as much care as the consensus protocol on top of it — that protocol is only ever as good as the suspicions you feed it.

Further reading. Chandra & Toueg (1996); Gupta et al., “SWIM” (2002); Hayashibara et al., “φ Accrual Failure Detector” (2004); van Renesse et al. (1998); Dunagan et al., “FUSE” (2004). Framing follows the failure-detection chapter of Alex Petrov’s Database Internals.