Here's the thing about LSM trees: every senior engineer eventually meets them, usually because something is on fire. A compaction storm at 4 AM, an IOPS budget that just got eaten, a write-heavy service that's mysteriously slow on reads — and someone says "yeah, it's an LSM thing." This post is the version of LSM I wish someone had handed me before the pager went off. We'll go from MemTable to compaction strategy, from bloom filter math to the tuning knobs that matter, and we'll be honest about where the structure sucks.
Why LSM exists at all (the B-tree problem)
B-trees are beautiful. They're also random-write machines. Insert a row, you walk down to a leaf page, modify it in place, dirty the page, eventually fsync it back to disk. On spinning rust that's a seek tax per write. On SSDs the seek is gone, but you're still doing read-modify-write at page granularity, blowing through the page cache, and writing 16 KB to mutate 100 bytes. Multiply by ten secondary indexes and you've got a write amplification problem dressed in academic clothes.
LSM flipped the script in 1996 (O'Neil et al., the OG paper). The idea is almost insultingly simple: don't update anything in place. Treat the disk like a log. Buffer writes in memory, flush them as sorted runs, and merge those runs in the background when you get bored. You trade read complexity for write throughput. For workloads where writes outnumber reads — telemetry, logs, ledgers, time-series, KV stores backing services — that trade prints money.
The core idea, blunt version
An LSM tree is essentially a multi-level cache hierarchy for writes. Each level is sorted, immutable, and progressively bigger and colder. When you write, you write to the top (RAM). When you read, you check top-down until you find it. When the top fills up, it spills into the next level. The next level fills up, it spills further. Forever.
Reads are inherently more expensive — you might have to check the MemTable, plus L0, plus L1, plus L2, etc., before you find the value (or determine it doesn't exist). That's why every LSM has bloom filters on each SSTable: a fast probabilistic "definitely not here" check that lets you skip 99% of the files for any given key. We'll deep-dive that later.
Anatomy: every part, what it actually does
MemTable — the in-RAM write buffer
The MemTable is where every write lands first. It's a sorted in-memory structure — almost always a skiplist (RocksDB's InlineSkipList, LevelDB's skiplist, Cassandra's ConcurrentSkipListMap). RocksDB also ships pluggable hash-skiplist and vector implementations, but skiplist is the default everywhere for one reason. Why sorted? Because when we eventually flush it to disk, we want the SSTable to be sorted so reads can binary-search and merges (compaction) can run in linear time.
Skiplist gets picked because it's concurrency friendly — multiple writers can insert with cheap atomic CAS operations and no rebalancing tantrums. A balanced tree (red-black, AVL) rebalances on insert, which forces coarse locking across the structure. At 100k writes/sec, that's the difference between throughput and a thread-contention slideshow.
// MemTable — pseudocode
memtable.put(key, value):
encode internal_key = key + monotonically_increasing_seq + op_type (PUT/DEL/MERGE)
skiplist.insert_concurrent(internal_key, value)
if memtable.size() > flush_threshold:
freeze(memtable) // becomes immutable
new_memtable = MemTable() // writers swing to the new one
background.flush_to_L0(frozen) // sequential write of a sorted SSTable
When the MemTable hits its size limit, it gets frozen — it becomes immutable, and a fresh MemTable takes over. The frozen MemTable sits around briefly while a background thread flushes it to L0 as an SSTable. This is why you'll often see two or more MemTables in memory at once: one active, one or more pending flush.
WAL (Write-Ahead Log) — the only reason your data survives a crash
Here's the awkward bit: the MemTable is in RAM. Your process crashes, the kernel panics, somebody trips on the power cable — that data is gone. So before the write hits the MemTable, it gets appended to a WAL file on disk. The WAL is sequential, append-only, and (depending on your durability setting) fsync'd before the client gets an ack.
If you crash, on restart the engine replays the WAL: re-applies every write into a fresh MemTable. Once a MemTable successfully flushes to an SSTable, the corresponding WAL segment is deleted (or marked obsolete). This is why people sometimes call the WAL the "redo log" — it's literally re-doing the work that hadn't been persisted yet.
fsync after every write = bulletproof but slow. fsync every N ms (group commit) = practical default. No fsync = eat hot data on crash; only acceptable for caches or replicated systems where another replica has the data. RocksDB's sync flag and Cassandra's commitlog_sync are exactly this knob.SSTable — Sorted String Table
SSTable is just a file. A sorted, immutable file. Inside, keys are stored in order, and there's a small index at the end pointing to "block X starts at byte offset Y." When the MemTable flushes, it walks its sorted contents and writes them into an SSTable in roughly this layout:
The sparse index is the small win that makes this whole thing work. You don't load every key into memory — you load one entry per block (a key separator, a few KB apart). To find key k, you binary-search the sparse index to find which block could contain it, read that one block (typically 4–64 KB), and binary-search inside it. That's one block read per SSTable for any key — and with the bloom filter and block cache in RAM, that often costs zero disk seeks.
Levels — L0, L1, L2, …, Ln
SSTables aren't all equal — they're organized into levels. L0 is the dumping ground for fresh flushes. Each subsequent level is roughly 10× the size of the previous one. That 10× ratio (configurable: max_bytes_for_level_multiplier in RocksDB) is the magic number that keeps the total number of levels small — typically 5–7 levels for terabytes of data.
The crucial property: at L1 and below, files have disjoint key ranges. A key can live in at most one file per level. That means a read at L1 hits exactly one SSTable (after the bloom filter), L2 one SSTable, and so on. L0 is the exception — because flushes happen independently, two L0 files might both contain key foo (the newer one wins). That's why L0 has staged thresholds in RocksDB: level0_file_num_compaction_trigger=4 kicks off L0→L1 compaction, level0_slowdown_writes_trigger=20 begins throttling writers, and level0_stop_writes_trigger=36 blocks writes outright until compaction catches up.
Compaction — the bill collector
Compaction is what makes LSM work. Without it, you'd have hundreds of L0 files and reads would die. Compaction is a background job that picks N sorted SSTables, merges them by key, and writes a new sorted SSTable. Old versions, deleted keys, and tombstones get dropped during the merge.
Compaction is also where tombstones get reckoned with. When you delete a key in an LSM, you don't actually go find and remove it (that would defeat the whole "no in-place writes" thing). Instead you write a tombstone — a record that says "k4 is dead as of seq 1234." Reads see the tombstone and treat the key as missing. The tombstone sticks around through compactions until it reaches the bottom level, at which point it can finally be dropped because there's nothing older underneath that could resurrect the key.
TombstoneOverwhelmingException. A read for a "missing" partition can scan thousands of tombstones before realizing nothing's there. This is why bulk deletes in Cassandra are a known footgun — you should TTL data instead, or use a separate keyspace that you can TRUNCATE.Bloom filters — the read-path savior
Without bloom filters, an LSM read for a missing key has to check the MemTable + every L0 file + one file per level (L1, L2, L3, L4) — that's 5–10+ disk I/Os just to confirm "not here." That's catastrophic for any read-heavy mix. Bloom filters solve this with delicious math.
A bloom filter is a bit array plus k hash functions. To insert key x, you hash it k times and set those k bits. To check membership of y, you hash y with the same k functions and check those bits. If any bit is 0, y is definitely not in the set. If all bits are 1, y is probably in the set (could be a false positive — collision).
The math you actually care about: with 10 bits per key and 7 hash functions, the false positive rate (FPR) is about ~1%. With 16 bits/key it drops to ~0.05%. Bigger filter = fewer wasted disk reads = more RAM. RocksDB defaults to 10 bits/key and lets you crank it via filter_policy.
"7 hash functions" sounds expensive but isn't — the standard trick is double hashing (Kirsch–Mitzenmacher): compute two real hashes, then derive the rest as h_i = h1 + i·h2. Same theoretical FPR, a fraction of the CPU.
Read path: how a GET actually happens
The read merges sequence numbers across all sources — the highest-seq-num version of the key wins. If the highest version is a tombstone, the result is "not found." That's it. Sounds simple; the implementation has decades of dragons.
The Meme Comparison Table™ — LSM vs B-Tree
| Vibe Check | 🌳 B-Tree | 🪵 LSM Tree |
|---|---|---|
| Write speed | OK on SSD, sad on HDD. In-place page mutations. | 🚀 Sequential appends. The chad of writes. |
| Read speed (hot data) | 🚀 ~log N, single tree walk. | Decent — bloom filters carry the team. |
| Read speed (cold data) | 1 disk seek, predictable. | 1–N seeks, depends on level + bloom luck. |
| Range scans | 🚀 Native, leaf pages are linked. | Multi-way merge across levels. Slower but works. |
| Space amplification | Low — some fragmentation, fillfactor headroom. | Higher — old versions wait for compaction. Strategy-dependent. |
| Write amplification | Moderate — page rewrites + WAL doubling. | High — every byte gets rewritten across levels during compaction. 😬 |
| SSD lifetime | Random small writes wear out cells faster. | Sequential big writes are SSD-friendly. |
| When your DBA cries | Fragmented index, page splits, REINDEX windows. | L0 stalls, compaction storms, 4 AM disk burst. |
| Your monitoring graph | Smooth-ish. | Sawtooth city. |
| Tuning knobs | Fillfactor, page size, vacuum. | 20+ knobs across compaction, bloom, levels, WAL. |
| Where you'll see it | Classic OLTP relational engines (InnoDB, Postgres, etc.). | Embedded KV libraries, distributed wide-column / KV stores, NewSQL local engines. |
| One-line take | "Pay at write, save at read." | "YOLO write, sort it out at 3 AM." |
Who uses LSM trees and why (overview level)
The list of systems built on LSM ideas is long, and any one of them deserves its own post. At the general level, here's the lay of the land — no specific config, no internals you'll regret quoting in a year:
- Embedded LSM engines — LevelDB (Google's original) and RocksDB (Meta's heavily-tuned successor). These are libraries, not databases. They get embedded inside other systems whenever someone needs a write-optimized KV store and doesn't want to reinvent one.
- Distributed wide-column / KV stores — Cassandra, HBase, ScyllaDB, BigTable. All chose LSM because their sweet spot is high write throughput across many nodes, with reads dominated by recent or partition-local data. SSTables on disk become the natural unit of replication, repair, and compaction.
- NewSQL / distributed SQL — systems like TiKV (under TiDB) and others use an embedded LSM engine as their per-node storage layer, with consensus (Raft/Paxos) and range-based sharding bolted on top. The LSM is the local commit/storage path; the distribution is a separate problem.
- Analytics engines, LSM-inspired — ClickHouse's
MergeTreefamily isn't a pure LSM, but it shares the philosophy: write small immutable parts, merge them in the background. Column-store on top, LSM-style merging underneath. - Append-only logs (related, not LSM) — Kafka's segment files are sequential, immutable, and background-compacted. Not an LSM (no levels, no MemTable), but the same family of trade-offs.
The pattern is the same everywhere: the workload is write-heavy, the storage is sequential-friendly, and reads can tolerate looking in more than one place. That's the LSM thesis. Each system layers its own concurrency, distribution, and indexing on top — and that's where they diverge enough that you should always read the docs of your version before quoting any specific number.
When to USE LSM (and when to not)
Don't @ me, but most people pick LSM because they read about it on a blog and forgot to ask if their workload matches.
| Workload | LSM verdict | Why |
|---|---|---|
| Write-heavy ingest (logs, telemetry, IoT) | ✅ Yes, gold standard | Sequential writes, append-shaped, reads are mostly recent. |
| Time-series | ✅ Yes | Recent data hot, old data cold — maps directly to level hierarchy. |
| KV cache or session store | ✅ Often | Bloom filters + page cache make point reads fast. |
| Append-only event store | ✅ Yes | Immutability is the whole point. |
| OLTP with hot point reads on warm data | ⚠ Maybe | Tunable, but B-tree predictability is hard to beat. |
| Heavy random reads on cold, multi-TB data | ❌ Risky | Read amplification stings; bloom filter RAM cost balloons. |
| Heavy in-place updates / counters | ⚠ Use Merge ops | Naïve PUT causes write amp explosion; use atomic merge operators. |
| Aggressive deletes / tombstone-heavy | ❌ Pain | Cassandra users have war stories. TTL > DELETE. |
| Single-writer in-process embedded DB | ✅ Yes (RocksDB/LevelDB) | Embedded, fast, configurable. Default choice. |
| Strict latency SLO on tail (p99.9) | ⚠ Watch out | Compaction storms tank the tail. Need universal/tiered tuning. |
The Three Amplifications
Write amplification (W-amp)
You write 1 GB of user data. By the time it's compacted from L0 → L1 → L2 → L3, that 1 GB has been re-written multiple times — once at flush, once per compaction it's involved in. Leveled compaction W-amp is meaningfully more than 1× and grows with the level count. This is the dirty secret of LSM. On cloud SSDs with provisioned IOPS, it's a real cost line.
Read amplification (R-amp)
For a single GET, how many SSTables and blocks did the engine actually touch? In the worst case (key not present), the read has to consult every level — that's where bloom filters earn their keep, eliminating most of the candidates. Range scans have higher R-amp because bloom filters don't help with iteration.
Space amplification (S-amp)
You have live data on disk plus all the old versions and tombstones that haven't been compacted away yet. Leveled keeps S-amp tight (most data lives in the bottom level, very little duplication elsewhere). Universal / tiered can spike during a major merge because the merge has to write the new run before deleting the inputs. FIFO is the cheapest by definition — it just drops files.
Compaction strategies — the deep dive
Leveled compaction (RocksDB default, Cassandra LCS)
Each level has a target size (10× the previous, by default). When a level overflows, pick one file from it, find overlapping files in the next level, merge them all into the next level. Read amp: low — at most one file per level. Space amp: low — most data sits in the bottom level, little duplication. Write amp: high — every byte gets rewritten roughly once per level on its way down. Use for read-heavy or space-constrained workloads.
Tiered / Size-tiered (Cassandra STCS, RocksDB universal)
Don't enforce strict level sizes. Instead, when N files of similar size accumulate, merge them into one bigger file at the next tier. Write amp: low — each byte gets rewritten far fewer times than in leveled. Read amp: higher — multiple sorted runs at each tier, all need bloom checks. Space amp: can spike during a major merge because the new run is written before the inputs are dropped. Use for write-heavy with predictable read patterns.
FIFO
No compaction. Drop the oldest SSTable when total size exceeds threshold. Only sane for time-series with TTL where old data is genuinely worthless. RocksDB supports it. Cassandra has a similar idea via TWCS + TTL.
TWCS (Cassandra Time-Window Compaction)
The "compromise" strategy. Group SSTables by time-window (say, 1 day). Compact within a window using STCS. Once a window is closed, never touch its files again. Perfect for telemetry — old days don't get rewritten 14 times. Read it, set it, sleep at night.
Bloom filter math (the senior version)
The false positive rate of a bloom filter with m bits, n inserted keys, and k hash functions is:
FPR ≈ (1 − e^(−kn/m))^k
Optimal k for a given m and n:
k* = (m/n) × ln(2)
Optimal m for a target FPR p with n keys:
m* = -(n × ln p) / (ln 2)^2
≈ -1.44 × n × log2(p)
Example: 1M keys, target 1% FPR:
m ≈ 9.6M bits ≈ 1.2 MB
k ≈ 7 hash functions
→ about 10 bits per key (the famous default)
The "10 bits per key, 7 hashes, 1% FPR" magic numbers come straight out of those formulas with p = 0.01. RocksDB's BloomFilterPolicy(10) is exactly this configuration. Crank to 16 bits/key for 0.05% FPR if your workload hits a lot of negative lookups.
Tuning, at the conceptual level
Specific config flags drift between RocksDB versions, Cassandra versions, and forks. Rather than quote a knob name that will be stale in a year, here's the conceptual checklist every LSM tuner walks through. Look up the exact option in your engine's docs:
- MemTable size. Bigger memtable = fewer flushes = fewer L0 files and less compaction churn — at the cost of RAM and longer flush pauses. This is the first dial to touch.
- Number of in-flight memtables. If flushes can't keep up with writes, you want a couple of memtables queued for flush rather than blocking writers. Too many wastes memory.
- Level multiplier. 10× per level is the standard. Higher multiplier = fewer levels = lower read amp but more work per compaction.
- L0 thresholds. Most engines have separate thresholds for "start compacting", "slow down writes", and "stop writes". If you ever hit "stop writes", your compactor is under-provisioned — that's the headline metric.
- Bloom filter bits per key. ~10 bits/key is the conventional default (≈1% FPR). Crank it for read-heavy / negative-lookup-heavy workloads, drop it only if you're starved for RAM.
- Block cache. Your read-side life raft. Size it so the working set of indexes + filters + hot blocks fits.
- Compaction parallelism / throughput cap. Compaction must keep up with ingest, but on a shared box it shouldn't starve the foreground. Most engines let you cap MB/s and parallel jobs separately.
- Compression. Light compressor (Snappy/LZ4) for hot levels, heavier (ZSTD) for the bottom level — colder data, more aggressive squeeze.
- Compaction strategy. Leveled for read/space-sensitive, tiered/universal for write-heavy, time-window for time-series. Pick once, monitor, revisit.
- Durability mode. Sync-on-every-write vs periodic group commit. Decide once, with the storage and replication topology in mind.
What to actually monitor (the short list)
| Metric to watch | Danger zone | What it means | Action |
|---|---|---|---|
| L0 file count | > 20 files | Compaction is falling behind ingest | Increase compaction threads or reduce write rate |
| Pending compaction bytes | > 50 GB | Write stall is incoming | Urgent: add compaction bandwidth |
| Write stall duration | > 0 ms | Already too late — writes are blocked | Emergency: reduce ingest, add resources |
| Bloom filter false positive rate | > 5% | Filter too small or bits/key too low | Increase bits_per_key in filter policy |
| Block cache hit rate | < 90% | Working set doesn't fit in cache | Increase block cache size |
| SSTable count at L0 | > slowdown_trigger | About to hit write slowdown | Monitor and pre-empt with compaction tuning |
A reasonable starting point for a write-heavy RocksDB workload — read the comments, then tune for your numbers:
// RocksDB starter config — write-heavy workload baseline
// All values are starting points. Profile before locking in.
Options options;
// MemTable — bigger = fewer flushes = less L0 pressure
options.write_buffer_size = 128 << 20; // 128 MB per memtable
options.max_write_buffer_number = 4; // up to 4 memtables in memory
options.min_write_buffers_to_merge = 2; // merge 2 before flushing to L0
// Compaction — leveled, keep L0 tight
options.compaction_style = kCompactionStyleLevel;
options.level0_file_num_compaction_trigger = 4; // start compacting at 4 L0 files
options.level0_slowdown_writes_trigger = 12; // slow writes at 12 (give headroom)
options.level0_stop_writes_trigger = 20; // hard stop at 20
// Level sizes
options.max_bytes_for_level_base = 256 << 20; // L1 = 256 MB
options.max_bytes_for_level_multiplier = 10; // L2=2.5GB, L3=25GB, L4=250GB
// Bloom filter — Ribbon filter, 10 bits/key ≈ 1% FPR, ~30% less RAM than classic Bloom
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(NewRibbonFilterPolicy(10));
table_options.cache_index_and_filter_blocks = true; // keep filters in block cache
// Block cache — size this to your working set (this is the right cache, not row_cache)
table_options.block_cache = NewLRUCache(2LL << 30); // 2 GB block cache
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
// Compaction throughput — don't let it starve or hog
options.max_background_compactions = 4;
options.max_background_flushes = 2;
options.rate_limiter = NewGenericRateLimiter(200 << 20); // 200 MB/s — bump to 500MB/s+ on NVMe
// Compression — light on hot levels, heavy on cold
options.compression_per_level = {
kNoCompression, // L0 — hot, fast access
kNoCompression, // L1 — still warm
kLZ4Compression, // L2 — start compressing
kLZ4Compression, // L3
kZSTD, // L4+ — cold, compress hard
};
Operational realities (the stuff blogs don't mention)
- Compaction storms. When you ingest in bursts, compaction debt accumulates and discharges in spikes. Your p99.9 latency hates this. Mitigation: rate-limit ingest, increase compaction parallelism, use universal/TWCS for bursty workloads.
- Cold-cache reads. First query after a deploy or restart? Bloom filters, indexes, and data blocks aren't in RAM yet, so what should be a single block-cache hit becomes real disk I/O. Pin hot filter / index blocks in cache (most engines have an option for this), warm caches on startup, or accept it as a known shape of the first few seconds.
- Long-tail deletes. A tombstone written today might not be physically removed for hours or days. Plan capacity for "deleted but not gone."
- Bulk loads. Don't naïvely
INSERT100M rows into a fresh LSM — you'll generate 100M MemTable inserts, a stack of flushes, and a compaction nightmare. Most LSM engines expose a bulk-load / SSTable-ingest API that lets you write pre-sorted SSTables and link them directly into the tree, skipping the MemTable / WAL / compaction pipeline entirely. - Snapshotting. Because SSTables are immutable, snapshots are basically free — you just hard-link the live file set. That's why LSM-backed databases tend to have very cheap point-in-time snapshots. Restore is the harder question.
- Disk topology matters. NVMe vs SATA SSD vs spinning rust changes the math. NVMe makes write amp cheap. Spinning disk makes leveled compaction painful — universal is friendlier.
Closing take
LSM trees are the right answer for a specific kind of problem: write-dominated, append-shaped workloads on modern storage where sequential I/O is cheap and you can tolerate a more expensive read path. Telemetry pipelines, KV stores backing services, time-series, log ingestion, event stores — these are LSM's home turf. For those workloads, the write-amplification tax buys you significantly higher write throughput, SSD-friendly write patterns, and operational handles (snapshots, immutable files, predictable compaction) that B-trees simply don't offer.
For OLTP with mixed reads/writes on multi-TB cold data, B-trees still rule. For analytics, columnar engines (with LSM-inspired merge trees underneath, see ClickHouse) are even better. The mistake is picking LSM because it's "the cool one." The win is picking LSM because your workload matches its trade-offs, then tuning compaction strategy, bloom filters, and L0 thresholds like you mean it.
Read the docs, set monitoring on the three amplifications, and never — never — let compaction fall behind. That's the whole game.