Two Leaders, One Cluster Every simple election has the same failure mode: split brain. network partition subset A L 2 3 "I am the leader" subset B L 5 6 "No — I am the leader"

Ask most engineers what a leader election does and you’ll hear “it picks the best node.” That’s the wrong mental model, and it’s the source of the single nastiest bug in this whole area. An election isn’t about choosing — it’s about agreeing. The job isn’t “find the highest-ranked node”; it’s “get every node to believe in the same leader at the same time.” The moment you see it that way, one uncomfortable fact drops out: getting everyone to agree on one value is consensus — and every shortcut that skips real consensus ships with the same failure, two leaders who don’t know the other exists. We call it split brain.

So here’s the thread the whole post hangs on: a leader is a value the whole cluster has to agree on, and agreement under failure is hard. The classic textbook algorithms — bully, ring, invitation — are clever and clean, and every one of them breaks the instant the network splits. Two words explain both why they break and how the systems you actually run (Raft, ZooKeeper, etcd, Kubernetes) fix it: agreement and majority.

Why bother electing a leader at all?

Coordination is expensive, and the cost is combinatorial. In a leaderless protocol where every step is an all-to-all exchange, a single round is O(N²) messages; push that across a quorum protocol with multiple phases and it compounds — and it hurts most exactly where you can least afford it, in geographically spread clusters where each round-trip is tens to hundreds of milliseconds of inter-region latency. A leader collapses that fan-out: peers stop negotiating with each other and instead talk to one coordinator, turning an O(N²) conversation into O(N) star-shaped traffic with a single authoritative view of the world.

That single authoritative view buys you concrete things:

  • Total order. Want every replica to apply writes in the exact same sequence? Funnel them through one leader that stamps an order and broadcasts it. Ordering-by-committee is a nightmare; ordering-by-leader is trivial.
  • A single writer. One process owns the mutation path, so you sidestep a whole category of concurrent-update conflicts.
  • Coordinated reorganization. After a failure, during startup, or when the topology changes, someone has to drive the reshuffle. The leader is that someone.

The obvious objection is that the leader becomes a bottleneck — one process can’t serve the whole world. Real systems answer this by not having one system-wide leader. They partition the data into independent replica sets and give each set its own leader. Google’s Spanner does exactly this: every Paxos group has its own leader, so leadership scales horizontally with the data. The leader pattern doesn’t mean “one boss for everything” — it means “one boss per unit of work.”

The deal you’re making. A leader trades coordination cost for a dependency. You stop paying for constant all-to-all chatter, but now the system’s progress hinges on one process being alive and reachable. That trade is almost always worth it — as long as you can detect when the leader dies and elect a new one quickly.

A leader is not a lock

These look identical on a whiteboard — “exactly one process gets the special role” — but they differ in a way that shapes every algorithm below.

With a distributed lock, nobody else needs to know who holds it. You only care that it’ll eventually be released so you get your turn. The holder is anonymous, and fairness matters: a lock that always favors the same process starves everyone else, violating liveness.

A leader is the opposite on both counts. Its identity is public — the whole point is that everyone must know who the leader is, so a freshly elected leader has to announce itself. And leadership deliberately plays favorites: a good, stable leader should keep the job for as long as it lives. Long-lived leaders aren’t a bug, they’re the goal — re-electing constantly is pure overhead.

A lock says “someone is in the room, wait your turn.” A leader says “this specific node is in charge, and here’s its name — write it down.”

The two properties every election is judged on

Like failure detection, leader election lives on a tension between two guarantees:

  • Liveness — most of the time there is a leader, and an election always eventually finishes. The system must never get stuck in “electing…” forever.
  • Safety — there is at most one leader at any moment. No split brain, ever.

Here’s the punchline you should brace for now, because it reframes everything: most of the famous election algorithms guarantee liveness but quietly violate safety. They will, under the right partition, hand you two leaders. That’s not always a mistake — as we’ll see, some systems accept multiple leaders on purpose and clean up conflicts afterward — but if you assumed safety and built on it, that assumption is a landmine.

The four families of leader election

Strip away the individual papers and every leader election falls into one of four families. The first three are the classics you meet in a textbook — each clever, each with a distinct message pattern, and each sharing one fatal flaw. The fourth is what every production system actually runs. We’ll take them in order, and for each one lay out plainly how it works, what it costs, and where it breaks.

1. Rank-based — the bully algorithm

The most-taught election is the bully algorithm (Garcia-Molina, 1982). Every process has a unique numeric rank, and the rule is brutally simple: the highest-ranked live process wins. It’s named “bully” because the top node muscles everyone else into accepting it — sometimes called monarchial election, where the highest-ranked sibling takes the throne the moment the old monarch dies.

A process triggers an election when its failure detector marks the leader as down (or at boot, when there is no leader). Three message types are in play — Election (I’m starting a round), Alive/Answer (I outrank you, stand down), and Coordinator/Elected (I’m the new leader). The initiator runs three steps:

  1. Send Election to every process with a strictly higher ID.
  2. Start a timer T ≈ one round-trip. If no Alive arrives before T, conclude no higher node is up and go to step 3. If an Alive does arrive, abandon this attempt and instead arm a longer timer T′; if no Coordinator announcement lands before T′ (the higher node that answered then died mid-election), restart the whole round.
  3. Declare victory and broadcast Coordinator to all lower-ranked nodes.

The cost is the tell. In the worst case — the lowest-ranked node notices the failure first — every node sends Election to every higher node, giving O(N²) messages per election. The two timers T and T′ also mean the latency to a decision is bounded by the failure detector’s timeout, not by network speed: tune them too tight and a GC-paused higher node gets skipped; too loose and failover drags.

Bully: leader 6 crashed, process 3 starts the election 3 4 5 6 crashed 1. Election → higher ranks 2. "Alive" from 4 and 5 3. Highest responder (5) broadcasts "Elected"

It works, and its simplicity is genuinely attractive. But it has two ugly failure modes.

Split brain under partition. Cut the network in half and each half runs its own election. Each half has some “highest live node,” so each half crowns a leader — two monarchs, neither aware of the other. The bully algorithm has no mechanism to prevent this, because a step-3 “I’m the highest I can see” is a purely local judgment.

The reelection storm. The strong bias toward high ranks backfires when the top node is flaky. An unstable high-ranked process proposes itself, wins, crashes seconds later, comes back, wins again, crashes again — the cluster burns its time re-electing instead of doing work. (One fix: circulate host-quality metrics and factor stability into the ranking, not just the ID.)

The takeaway from bully. “Pick the highest ID I can currently see” is a local decision dressed up as a global one. Two nodes in two partitions can both run it honestly and both be “right” — which is exactly how you get two leaders.

Variants that trim the cost

Because the raw bully algorithm is chatty and re-elects from scratch every time, there are well-known refinements:

  • Next-in-line failover. Each elected leader publishes an ordered list of backups: {5, 4}. When the leader dies, whoever notices contacts the top backup directly. If it’s alive, it becomes leader immediately — no full election round. Fewer messages when the obvious successor is healthy.
  • Candidate / ordinary split. Divide nodes into a small candidate set (eligible to lead) and a larger ordinary set (can only trigger and observe). An ordinary node that spots the failure polls the candidates, picks the highest live one, and announces it. To stop several nodes kicking off elections at once, each node waits a distinct delay δ before starting — higher-priority nodes wait less — so one election usually gets going before the others fire.

These trim the message count. Neither of them fixes split brain, because none of them establishes a cluster-wide majority.

Pros
  • Trivial to reason about — highest live ID wins, full stop.
  • Deterministic outcome: same live set always elects the same leader.
  • Fast when a high-ranked node initiates (short path to a decision).
Cons
  • O(N²) messages in the worst case (lowest node initiates).
  • Split brain under partition — each side crowns its own leader.
  • Reelection storms when the top-ranked node is flaky.
  • No notion of a quorum, so no safety guarantee at all.

2. Topology-based — the ring algorithm

The ring algorithm (Chang & Roberts, 1979) trades all-to-all broadcast for a fixed topology, and in doing so drops the message count from O(N²) toward O(N). Every node knows only its successor. When a node detects the leader is gone, it emits an Election message carrying a set of live IDs (starting with its own) to its successor. Each recipient appends its own ID and forwards. A successor that doesn’t ACK within the failure-detector timeout is skipped — the sender walks to the next node clockwise until one responds — so no single dead node stalls the traversal.

The classic Chang–Roberts optimization is comparison-based and needs no accumulated set at all: a node forwards an incoming ID only if it is greater than its own, swallows it if smaller, and recognizes its own ID coming back as the signal that it is the maximum. Because max is associative and commutative, a single running maximum suffices — and suppressing smaller IDs is what keeps the common case linear rather than quadratic.

Ring: the election message walks the circle, skipping the dead node 6 crashed 5 1 3 4 3 → 4 → 5 → (skip 6) → 1 → back to 3 live set collected: {3,4,5,1} → max = 5 wins

When the message completes a lap back to the initiator, the winner is known; a second lap distributes the Coordinator announcement. Two things bite in practice. First, ring membership must be maintained — a node that dies takes its position with it, and rebuilding successor links after churn is its own coordination problem (this is why pure ring election shows up more in theory than in production). Second, concurrent initiators: if several nodes start rounds at once you get multiple messages circulating, and without the comparison-based suppression the count degrades toward O(N²) as every round makes a full lap.

And it has the same fatal flaw as bully. Partition the ring and each arc heals into its own smaller ring, completes its own traversal, and elects its own maximum. No safety — because, again, “the max I can reach” is a local predicate.

Pros
  • O(N) messages in the common case — far cheaper than bully.
  • Each node knows only its successor; no all-to-all fan-out.
  • Comparison-based variant needs O(1) state per message (a running max).
Cons
  • Split brain — a partitioned ring elects one leader per arc.
  • Ring membership must be repaired on churn — a coordination problem of its own.
  • Concurrent initiators degrade toward O(N²) without suppression.
  • Higher latency: a decision needs a full lap around the ring.

3. Group-merge — the invitation algorithm

The invitation algorithm takes the split-brain reality and leans into it. Instead of outranking rivals, processes invite each other into groups. Everyone starts as the leader of a group of one. Leaders invite outsiders to join; when two leaders meet, their groups merge and one of them steps down. To keep merges cheap, the leader of the larger group usually wins, so only the smaller group has to relabel its leader.

Invitation: two groups merge, the larger leader keeps the crown before L1 2 group A L3 4 group B "join me" after L1 2 3↓ 4 one group · L3 steps down to follower

This explicitly allows multiple leaders — one per group — and treats the whole thing as a merge problem rather than a winner-take-all fight. It’s a useful reframing: sometimes “several leaders that reconcile” is genuinely what you want, and merging groups is far cheaper than re-running a global election from scratch.

Pros
  • Handles partitions gracefully — sub-groups keep working, then merge.
  • No election-from-scratch: merging is cheaper than a fresh round.
  • Merging the smaller group into the larger minimizes relabeling.
Cons
  • Multiple leaders by design — you must reconcile divergent state.
  • Only useful when the workload tolerates several independent leaders.
  • More moving parts (group membership, merge protocol) to get right.

The bug they all share, stated plainly

Bully, ring, invitation — walk them through a network partition and every one produces multiple leaders. This isn’t a coincidence or a set of independent bugs. It’s the same bug, and it has a precise cause:

The root cause. Each of these algorithms lets a node declare victory based on what it can see. But in a partition, what a node can see is only its side of the split. “I’m the highest node I can reach” is true simultaneously in both halves. Any rule that can be satisfied locally can be satisfied twice.

So the fix can’t be “a better local rule.” It has to be a rule that can’t be true in two places at once. There is exactly one such rule.

The only real fix: demand a majority

To become leader, don’t ask “am I the highest I can see?” Ask “can I get votes from more than half of all nodes?” — a strict majority, ⌊N/2⌋ + 1, counted against the total cluster size, not just the reachable part.

Why this closes the door completely: two disjoint majorities of the same set can’t exist. If subset A has more than half the nodes, subset B has fewer than half and cannot assemble a majority. At most one partition can ever elect a leader; the minority side is stuck without one until the partition heals. You’ve traded away availability on the losing side to buy safety everywhere — and that’s usually the right trade for anything holding state.

5 nodes split 3 : 2 — only the majority side elects partition 3 nodes — majority ✓ L 3 of 5 votes → leader elected 2 nodes — minority ✗ ? ? only 2 of 5 → no leader, refuses to serve writes
Why clusters are odd-sized. With an even N, a perfectly balanced split (2–2, 3–3) leaves neither side with a majority — nobody can lead. An odd N guarantees exactly one side wins any split. That’s why you see 3, 5, and 7-node clusters everywhere and almost never 4 or 6: the 4th node adds cost and failure surface without improving your fault tolerance (both 3 and 4 nodes tolerate one failure).

So… leader election is consensus

Step back and look at what “get a majority to agree on one leader” actually is. It’s a group of processes agreeing on a single value under failures. That’s the literal definition of consensus. This is the deep punchline of the whole topic, and it cuts both ways:

To elect a leader, you must reach consensus on its identity. But if you can reach consensus on the leader’s identity, you can reach consensus on anything. Leader election and consensus are the same problem wearing different name tags.

This is exactly why the algorithms that do get safety right — Raft, Multi-Paxos, ZAB — don’t bolt a “leader election module” onto the side. Election is folded into the consensus protocol itself. It’s worth walking one all the way through, because we gave bully and ring a full step-by-step and the correct algorithm deserves at least as much.

4. Quorum-based — consensus (Raft, Multi-Paxos, ZAB)

This is the family that gets safety right, and the one your infrastructure actually runs. Instead of a local “highest I can see,” a candidate must win votes from a majority of the whole cluster before it may lead. We gave bully and ring a full step-by-step, so the correct algorithm deserves the same — let’s walk Raft all the way through.

Raft (Ongaro & Ousterhout, 2014) was designed to be understandable, and its election rests on three rules that, together, make split brain impossible:

  • Terms. Time is chopped into numbered terms — a monotonically increasing counter. Every message carries a term. A term has at most one leader, so “which leader?” collapses to “leader of which term?”, and a higher term always beats a lower one.
  • Vote once per term. A node grants its vote to at most one candidate per term. This single rule is what mathematically forbids two leaders in the same term: two candidates can’t both gather a majority if no node votes twice.
  • Randomized election timeouts. Each follower waits a random interval (say 150–300 ms) after last hearing from the leader before it stands for election. The randomness means one follower almost always times out first — the practical cure for the bully algorithm’s reelection storm.

Here’s a full round. A 5-node cluster loses its leader for term 4:

  1. Follower C’s random timer fires first. It bumps the term to 5, votes for itself, becomes a candidate, and sends RequestVote(term=5) to everyone.
  2. Each node that hasn’t voted in term 5 checks the candidate’s log is at least as up-to-date as its own, then replies granted — and records “I voted in term 5,” so it can’t vote again.
  3. The instant C has votes from a majority (3 of 5, counting itself), it becomes leader for term 5 and immediately sends heartbeats to shut down any other timer.
  4. If a straggler D had also timed out and asked for votes in term 5, the nodes that already voted for C reply denied. D can’t reach a majority, gives up, and becomes a follower the moment it hears C’s higher-or-equal term.
Raft: candidate C wins term 5 with 3 of 5 votes Ccandidate term 5 · self-vote (1) A B D E granted (2) granted (3) → majority! reachable or not, C already won

Multi-Paxos and ZAB use the same shape under other names — Paxos calls the counter a ballot/round number, ZAB calls it an epoch. The recurring trick is always a monotonic number plus a majority: agree on a leader-per-epoch, and let a higher epoch override a stale one.

Pros
  • No split brain — two majorities of one set can’t coexist.
  • Safety and liveness proven, not hoped for.
  • The elected leader is a genuine consensus leader — usable for ordering, replication, everything.
  • Randomized timeouts kill the reelection storm.
Cons
  • Minority partitions lose their leader and stop serving writes (CP over AP).
  • Needs an odd, majority-reachable quorum — more nodes, more cost.
  • More complex to implement correctly (log matching, term rules, persistence).
The reframe worth keeping. Bully and ring try to elect a leader and hope for agreement. Raft and friends agree on a leader and get the election for free. Starting from “agreement” instead of “selection” is the entire difference between an algorithm that split-brains and one that doesn’t.

The four families, side by side

#FamilyExampleMessagesSafe under partition?Where it fits
1Rank-basedBullyO(N²) worst caseNo — split brainTeaching; tiny trusted clusters
2Topology-basedRing (Chang–Roberts)O(N) typicalNo — one leader per arcMostly theoretical; token-ring lineage
3Group-mergeInvitationVaries (merge cost)Allows many leaders by designAP systems that reconcile later
4Quorum / consensusRaft · Multi-Paxos · ZABO(N) per round, needs majorityYes — majority forbids itEverything in production (etcd, ZK, …)

Keeping a leader: stable leader election

Winning a vote is half the story; you also want a leader to keep the job so you’re not paying for elections constantly. The stable leader election algorithm (Aguilera et al., 2001) pairs election rounds with timeout-based failure detection: a leader holds its position for as long as it doesn’t crash and stays reachable, and only a suspected failure triggers a fresh round. This is why it’s worth reading leader election and failure detection together — a leader’s identity can change without you noticing, so “is my local idea of the leader still valid?” is answered by the failure detector, not by the election alone. (This is the through-line back to suspect-and-corroborate: you elect on a suspicion, and you’d better be right.)

Why “two leaders” is sometimes on purpose

Here’s the subtlety that surprises people: the safety-first protocols don’t always forbid two leaders in flight — they let a brief overlap happen and then resolve it, because that’s faster than preventing it outright. Dropping strict safety for a moment is a performance optimization: a would-be leader can start replicating immediately instead of waiting for a global lock, and correctness is recovered by detecting and killing the conflict.

  • Multi-Paxos tolerates two competing proposers, but only one can complete: the second quorum a proposal must collect guarantees two different proposers’ values can’t both be accepted.
  • Raft lets an old leader keep running until it discovers, from any message carrying a higher term, that its term is out-of-date — at which point it steps down. The stale leader’s uncommitted entries never reached a majority, so they’re simply overwritten.

In both, having a leader is how you guarantee liveness; allowing a fleeting second one is how you avoid stalling — and safety is restored by conflict resolution rather than by never letting it happen. That’s the same instinct as the invitation algorithm, just with a majority underneath it so the resolution is decisive.

The zombie leader, and the fencing token that kills it

Majority voting stops two leaders from being elected at the same time. But there’s a subtler ghost, and it bites systems that treated election as a solved checkbox. Picture the old leader — node 6 — that didn’t actually crash. It just froze: a long GC pause, a swap storm, a VM paused by the hypervisor. The rest of the cluster gives up on it, elects node 5 with a fresh majority, and moves on. Then node 6 wakes up. From its point of view nothing happened. It’s still the leader. It happily resumes writing.

Now you have a zombie leader: a formerly-legitimate leader acting on stale authority. A pause of a few seconds is enough. No amount of “only a majority can be elected” helps here, because 6 was never re-elected — it never noticed it was deposed.

The fix is a fencing token (Martin Kleppmann’s framing): every time a new leader is elected, hand it a number that only ever goes up — conveniently, the term/epoch you already have. The leader must attach that token to every write it sends downstream. The resource on the receiving end — the storage node, the lock service, the database — remembers the highest token it has seen and rejects anything older.

// Every write carries the leader's fencing token (its term/epoch).
let highestSeen = 0;

function acceptWrite(request) {
  if (request.token < highestSeen) {
    // A newer leader has already been here. This is a zombie.
    throw new Error(`fenced: token ${request.token} < ${highestSeen}`);
  }
  highestSeen = request.token;   // ratchet forward, never back
  return commit(request);
}

When zombie 6 wakes up and writes with token 41, the storage layer has already accepted token 42 from the new leader 5. Token 41 is stale, so it’s bounced. The zombie can believe it’s the leader all it likes — it can’t do anything, because its writes get fenced at the door.

Why this matters. Leader election is never “done” at the moment of election. Correctness depends on the deposed leader being unable to act — and you can’t rely on it noticing it was deposed (it may be frozen or partitioned). A monotonic token enforced at the resource is how you make “there was only ever one effective leader” true, even when there were briefly two live ones.

Where you actually meet this

You rarely hand-roll an election — you inherit one from a system you run. Here’s the map from the ideas above to the tools on your stack:

SystemHow it electsThe core idea in play
ZooKeeperClients create ephemeral sequential znodes; the lowest sequence number is the leader; others watch the next-lower node.Majority (ZAB quorum) + ephemeral nodes auto-releasing on session loss
etcdRaft under the hood; campaign on a key backed by a lease.Terms + majority votes + lease expiry as failure detection
KubernetesControllers/scheduler race to hold a Lease object (holderIdentity, renewTime); the holder renews, others wait for it to expire.Lease + fencing on renew time — lightweight, single-writer control loops
Kafka (KRaft)A dedicated controller quorum runs Raft to elect the active controller (replacing the old ZooKeeper dependency).Raft quorum for the metadata log
Redis SentinelSentinels detect a down master, elect a Sentinel leader (Raft-like) to run the failover, and promote a replica.Majority of sentinels + epoch numbers to fence stale failovers
Consul / NomadRaft across server nodes.Terms + majority — same shape as etcd

Notice the pattern: the toy algorithms (bully, ring) show up in textbooks; the production systems all converge on majority + a monotonic epoch, because that’s the only combination that’s actually safe.

The takeaway

“Pick the highest-ranked node” is a fine mental model right up until the network splits — and then it hands you two leaders and a corrupted dataset. The reframe that fixes it is small but total:

  1. An election is agreement, not selection. The hard part was never choosing a node; it was getting everyone to choose the same one under failure. That’s consensus, and consensus needs a majority — the one rule that can’t be satisfied in two partitions at once.
  2. Election doesn’t end at the vote. A deposed leader may not know it lost. Use a monotonic fencing token (the term/epoch) so a zombie’s stale writes get rejected at the resource. One leader elected isn’t enough; you need one leader effective.

So when you’re choosing how a system handles leadership, ask the two questions that separate the toys from the real thing: does it require a cluster-wide majority to win? and does it fence the old leader’s writes? If the answer to either is no, you don’t have a leader election — you have a split brain waiting for its partition.

Further reading. Garcia-Molina, “Elections in a Distributed Computing System” (1982); Chang & Roberts, “An Improved Algorithm for Decentralized Extrema-Finding” (1979); Ongaro & Ousterhout, “In Search of an Understandable Consensus Algorithm (Raft)” (2014); Lamport, “Paxos Made Simple” (2001); Aguilera et al., “Stable Leader Election” (2001); Martin Kleppmann, “How to do distributed locking” (2016) for fencing tokens, and Designing Data-Intensive Applications ch. 8–9 for the consensus framing. Structure follows the leader-election chapter of Alex Petrov’s Database Internals.