Why your consistent hashing still fails in prod Hash ring vs range-based partitioning — failure modes, case studies, and when to use which Hash ring Clockwise, elastic adds/removes Key ranges [a..b) [b..c) [c..d) Contiguous on the key line — you pick cuts Same traffic, different invariants: opaque placement vs sortable segments

You shipped a hash ring. The deck looked perfect. PagerDuty disagreed. The ring only stopped you from re-mapping every key when N twitches; it does not hand you even load, agreement on who is alive, or range scans for free. Below: failure modes with diagrams, three case studies with before/after, a real compare on range-based partitioning, and a table for choosing your pain.

Why your consistent hashing is failing anyway

The algorithm is a placement policy, not a luck charm. These three are what show up in real incidents when the whiteboard was “correct.”

Hotspots and uneven load

Uniform hash spread is a statistical story. In production, a viral ID, a default tenant, or a shared prefix maps so much traffic to one physical host that your dashboard looks like a binary star. What breaks: p99 latency, CPU pegged on one node, “fair” autoscaling that adds replicas where load is not the problem. What actually helps: more virtual nodes, per-tenant rings, admission limits, and admitting that a ring is not a load generator.

Hotspot on the ring: same algorithm, one overloaded physical node NODES (CPU%) 18% 22% 28% 94% HOT 20% 19% Virtual nodes help spread; they do not delete a runaway key prefix or one celebrity tenant

Successor overload when a node dies

Failed node’s keys walk clockwise to the next host. If that host was already full, the outage is not “one box” — the successor takes a double serving and may fall over, domino-style.

Node D drops — every key that stopped at D re-homes to E ring (simplified) A B C D offline E (inherits D's arc) E was already at 70%? This is a bad day
Operations takeaway: size successors for “failure + handoff” load, not average steady state. Replicas and bounded queues are part of the routing story.

Stale ring views and “split” routing

Process A still routes with ring generation 41. Process B uses 42 (new node added). For minutes, the same key can land in two different places. That is a hash ring failure in the “people disagree on truth” sense — duplicate work, 409s, or silent divergence.

Two clients, two different membership snapshots Client (cache) — gen 41 key K → node … → host 10.0.1.4 10.0.1.4 was removed 90s ago Writes may 404, retries amplify Client (fresh) — gen 42 key K → host 10.0.1.8 ✓ Convergence: gossip TTL + version gate on deploys Test membership updates the way you test payments — not “eventual, probably”

Hash ring 101 — enough to compare with ranges

Map hash output to a circle, put nodes on it (classically many virtual nodes per box), hash each key, walk clockwise to the first node — that is the owner. hash(key) % N re-homes most keys when N changes; a ring re-homes keys only near the add/remove — the waiter-quit analogy: you re-seat one section, not the whole restaurant.

One physical host, many virtual node markers hash space B1 B2 B3 Physical: host B 3 virtual nodes → better spread vs one dot If B dies, 3 adjacent arcs re-home — still neighbor effects
Ring (0..max) clockwise ·A · · ·B (virtual x3) key K → first pos >= h(K), wrap ·C
# Modulo: resize N → almost every key moves
node = abs(hash(key)) % N
# Consistent: sorted positions; find successor on ring; virtual nodes = many pos per host
# Range: which interval [lo, hi) contains key? split/move intervals to scale

What the ring is not: a linearizability layer, a fix for SELECT … BETWEEN without a query plan, or a substitute for agreed membership. For ordered primary keys and big scans, read on.

Case studies — when the diagram matched reality (and when it did not)

Composite stories from production-style incidents — numbers are illustrative, not a single named company. They are useful because they show the metric shape of each failure class.

Case 1: Viral product key on a large session cache (hash ring, Redis-like)

Shape: 18-node cache, consistent hashing in front, JSON blobs keyed by session:{id} and shared read-through to product:{productId} for a flash sale. Trigger: one product:4421 went viral. Observed: one primary shard CPU 91%, others 18–30%; p99 get from 4 ms → 180 ms for unrelated keys co-located on that node’s responsibility arc. Root cause: not N or the hash — application key skew. The ring was fair; the business was not.

What worked: a tiny second cache pool with explicit ephemeral keys for the hot product, negative caching for thundering miss loops, and a 10k RPS per-key ceiling in the app. Reclustering the ring would have been theater.

Case 2: Rolling deploy with two ring generations (7 minutes of “ghost” node)

Shape: 40 edge nodes each embedding a 2 MB cluster map. Trigger: canary on 5 nodes got map v412; rest still v411 with one host removed. Observed: 0.08% of writes duplicated or retried to wrong target; reconciliation job depth +3×. Root cause: clients not atomically switching maps at the same generation.

What worked: server-side generation in every response, client refuses stale maps for writes >30s, deploy step: “wait for 99% gen match” before next wave. Pager noise dropped to background.

Case 3: Time-series + monthly ranges — backfill made January “the fat shard”

Shape: metrics DB sharded by month on (tenant_id, t) PK. Trigger: 6-day historical backfill for one tenant. Observed: January partition 4× the read QPS of February; p99 on that range 2.1s vs 120 ms elsewhere. Root cause: hot range — a range partition problem, not a ring problem.

Range split: fat month → sub-ranges, move half to a new node Before: one range owns all January inserts [ Jan 1 … Feb 1 ) — hot: backfill + live After: admin split at Jan 15; right half moved to new tablet server [ Jan 1 … Jan 15 ) [ Jan 15 … Feb 1 ) → new node Cost: one-time data copy + cutover — cheaper than rethinking the whole sharding function
What worked: split the month at a mid-month boundary, move the right sub-range to a fresh node, then throttle backfill fan-out. Range gave a nameable lever: “the January right-hand tablet.”

Range-based partitioning — deeper (why teams still love it)

You carve the sortable keyspace into [start, end) intervals. A shard (tablet, region, “partition”) answers every key that sorts in that half-open range. Range-based partitioning is how Bigtable, HBase, Cockroach, Spanner-style systems, and many SQL shard routers (Vitess, Citus, etc.) think — because the storage engine already orders keys.

Lookup: binary search a sorted list of ranges (or a tree), often cached. Rebalance: split a range, move a subrange, or merge. Failure modes mirror the ring’s but on a line: hot range (one interval gets all the traffic), bad cut (split that does not split load), stale range map (two routers disagree on boundaries).

Key order as a line — ranges are intervals on that line min max R1 Host A R2 Host A R3 Host B Split R3 if it grows; move [lo,mid) or [mid,hi) to a new host — work is local to that segment

What you gain vs a pure hash ring

  • Range and prefix scans that can stay on one (or a few) shards — time windows, WHERE pk BETWEEN, “all of tenant 7’s rows” with a well-chosen key.
  • Operational handles: “this interval is the fire” beats “this arc after 0x9f2…” for humans on call.
  • Alignment with time-ordered PKs so you can plan where new writes land.

What you pay

  • Hot ranges and bad boundaries — same class as hot spots, different axis (sort order).
  • Rebalancing is explicit work — copy, verify, cut over; automation helps but is not free.
  • Every client needs the range map — same discipline as a ring: versions, health, tests.
Neither model deletes skew. If 40% of traffic is one key prefix, you need product limits, dedicated resources, or manual cut points — the layout only decides how pain propagates when you add iron.

When to use a hash ring vs range-based partitions

Use the table when you are picking a default for a new system — not when you are cargo-culting the last project.

FactorPrefer consistent hashingPrefer range partitions
Key shapeOpaque IDs, session keys — no business sort in the key.Sortable PK, time, tenant prefix you rely on in queries.
QueriesPoint gets/sets, cache semantics.Scans, feeds, BETWEEN, time-series rollups.
ElasticityNodes in/out often; want small remaps.More static cluster, or ops-driven splits.
ControlVirtual nodes, many rings, app-level sharding of hot keys.Named intervals, move a tenant or month by moving a range.
StackCaches, Dynamo-style K/V, many pub/sub consumer maps.Wide-column, distributed SQL, tablet stores.

Combos that are not hypocrisy: ring at the cache (elastic, opaque) + range in the database (ordered, queryable). Hash to bucket then range inside is also common — two layers, two invariants, document both.

Closing

Consistent hashing is still the right default for a lot of elastic, point-read-heavy, opaque-key infrastructure. Range-based partitioning is the right default when the product is about order and ranges. The case studies above share one theme: the incident report starts with metrics and ownership, not with “we need more vnodes” as the only knob. Map your failure modes, pick the pain you can run in operations, then draw the pretty picture on top — not the other way around.