UUID v4 vs UUID v7 vs Snowflake The primary key decides what the database does on every insert UUID v4 Fully random, 128 bits ~500 page splits / sec Every insert hits a random page Index bloats to ~2× the data size Buffer pool cache miss on writes Snowflake / UUID v7 Timestamp in the high bits ~0 page splits / sec Every insert hits the last page Index stays dense, ≈100% fill One hot page in buffer pool 40× faster 50% less disk 99.9% cache hit

The one line that matters

A primary key works well when the next value is always bigger than the last one. That's it. Every problem in this post comes from breaking that rule.

The database stores your table as a B-tree sorted by primary key. "Sorted" means rows are kept in order, physically, on disk. When the keys arrive in order (1, 2, 3, 4...), every new row goes to the end — easy. When the keys arrive in random order, every new row goes somewhere different — hard.

UUID v4 is the random case. AUTO_INCREMENT, Snowflake, and UUID v7 are the in-order case.

How the database stores your rows

A B-tree is a tree of pages. Each page is 16 KB (MySQL/InnoDB default). The leaf pages at the bottom hold your actual rows, sorted by primary key. A row of ~250 bytes fits about 60 rows per page. A table with 10 million rows needs roughly 150,000 leaf pages.

B-tree structure — rows live in the leaves root page pointers to branches branch branch branch leaf L1 rows 1..60 100% full leaf L2 rows 61..120 100% full leaf L3 rows 121..180 100% full leaf L4 leaf L5 ... rightmost new rows → filling 10M rows ≈ 150,000 leaf pages · each leaf = 16 KB · rows stored sorted by PK

When you run INSERT INTO orders VALUES (...), the database has to find which leaf page this new row belongs in, then write it there. Here are the six steps it runs every time:

Six steps in every INSERT 1 Walk down the tree to find the right leaf page root → branch → leaf. Fast if pages are in memory. Slow if we have to read them from disk. 2 Does the row fit on that page? yes → drop it in, done. no → page split (expensive, see below). 3 Write a redo-log entry So the insert survives a crash. More pages changed = more redo bytes. 4 Write an undo-log entry So the insert can be rolled back, and other transactions see the right snapshot. 5 Insert a pointer into every secondary index Each secondary index is its own B-tree. Each one gets walked, checked for room, maybe split. 6 Mark the modified pages dirty They'll be flushed to disk later. More dirty pages = more background I/O pressure. The PK choice decides which of steps 1, 2, and 5 are cheap — and which are a disaster

With a sequential key, step 1 always lands on the same last page (already in memory, warm). Step 2 always has room until the page fills, and then the database just opens a fresh empty page — no data moves. With a random key, step 1 lands on a random page (usually not in memory), and step 2 usually finds the page already full (splits happen). That's the whole story.

What a page split actually costs

Here is what the database does when it has to split a page. This is the operation you want to avoid.

Page split — before and after Before: page P is full, new row won't fit Page P 100% full · 60 rows new row wants to go here key is between rows 30 and 31 but P is full SPLIT triggered After: half of P's rows move to a new page Q Page P ~50% full · rows 1..30 Page Q (new) ~50% full · rows 31..60 + new row Plus all this behind the scenes: • Two page writes (P and Q) • Parent node gets a new entry pointing at Q • If the parent is full, the parent splits too — can cascade up log₂(N) levels • Redo log entries for every page that changed • Fragmentation is permanent until the index is rebuilt

There's a shortcut the database takes when the new key is the biggest one so far: it doesn't move any rows. It just opens a fresh empty page for the new row. P stays 100% full, Q holds one row, and nothing had to be copied. This is why sequential inserts are so cheap.

This shortcut only works when the new key is bigger than everything already on the rightmost leaf. Random inserts never take it — they land in the middle of the index and force the mid-split.

Sequential vs random inserts, side by side

Where new rows land Sequential PK (Snowflake, v7, AUTO_INC) L1 full L2 full L3 full rightmost filling new new new All inserts go to one page. One page stays in the buffer pool. Pages fill up to ~100%. No splits. Random PK (UUID v4) L9 L23 L54 L88 L140 L201 L307 ... ... 150k new new new new Every insert hits a different page. Most pages aren't in memory → disk read. Pages are often full → split. Long-run fill ≈ 69%, real-world 50–60%.

Why random keys bloat the index

After a random insert splits a page, the two resulting pages are about half full. More random inserts refill them over time — but new random inserts into other pages create more half-full pages. The math settles at about 69% fill (a known result, ln 2). In real production, with variable row sizes and deletes, tables sit at 50–60% fill.

In plain numbers: a table that would be 6 GB with an ordered key becomes 10–13 GB with a random key. Same rows. More disk. More memory needed to keep it warm. More bytes to back up every night.

Why random keys break the buffer pool

The buffer pool is the database's in-memory cache of pages. Inserts stay fast when the page you're writing to is already cached. A miss means reading 16 KB from disk — tens of microseconds on NVMe, several milliseconds on cloud block storage. On a busy system, miss rates above 10% mean every write is waiting on disk.

What you need in the buffer pool to stay fast Sequential PK Hot working set just a few pages (~160 KB) Fits in any buffer pool. Hit rate stays ~100%. No cliff. Growth is safe. Random PK Hot working set the entire index (~9–13 GB at 10M rows) Fits until the index outgrows RAM. Then every insert waits on disk. Cliff at ~buffer_pool / index_size.

This is why teams see the "it was fine at 8 million rows, broken at 12 million" behavior. Nothing changed in the code. The index just crossed the line where it no longer fit in memory.

Why secondary indexes suffer too

A secondary index stores, for each row, the indexed columns plus a copy of the primary key. The primary key is the pointer back to the full row. So the PK's size and randomness hit every secondary index too.

Size of a secondary index entry BIGINT PK (AUTO_INC, Snowflake) indexed col PK 8B = ~20 bytes per row UUID PK stored as BINARY(16) indexed col PK 16B = ~28 bytes per row (+40%) UUID PK stored as CHAR(36) ← avoid indexed col PK 36–40B (string) = ~50 bytes per row (+150%)

Ten million rows, five secondary indexes, UUID PK as BINARY(16) instead of BIGINT: 400 MB of pure primary-key duplication before any fragmentation. With CHAR(36) it's 1.4 GB. This is disk you're paying for that holds no actual data.

Now — the three IDs

Same B-tree, same engine, three different stories. For each: what it is, why people pick it, what breaks.

UUID v4 — the random one

UUID v4 — 128 bits 122 bits — random ver var random No ordering. Sort order is meaningless. Every insert is a random page.

What it is. A 128-bit identifier where almost every bit is random. Two UUIDs made a microsecond apart look completely unrelated.

Why people pick it. The library is everywhere. crypto.randomUUID() in Node, uuid.uuid4() in Python. No coordination needed — any process on any machine can generate one and it won't collide. It hides how many records you have from outsiders. It looks professional.

What breaks. Every B-tree property you just read about:

  • Every insert lands on a random leaf page. Most of the time that page isn't in memory → disk read.
  • Random pages are often full → page split.
  • Splits leave the index 50–60% full. Storage nearly doubles.
  • Eventually the index doesn't fit in the buffer pool. Writes start waiting on disk.
  • The 16-byte PK multiplies into every secondary index on the table.
  • Stored as CHAR(36), all of the above doubles.

When it's fine. Small tables (under a few million rows). Tables you rarely insert into. As a non-PK unique column alongside a BIGINT PK — the BIGINT stays as the clustered key, the UUID is just a secondary unique lookup.

Snowflake ID — the compact one

Snowflake — 64 bits, fits a BIGINT s 41 bits timestamp (ms since custom epoch) 10 bits machine id 12 bits seq ≈ 69 years range 1024 machines 4096 / ms / node

What it is. A 64-bit integer built from three parts: the current millisecond, which machine made it, and a counter that resets every millisecond.

Why people pick it. It's small (8 bytes — same as a BIGINT). The timestamp is in the high bits, so IDs made later are always numerically larger than IDs made earlier — every insert goes to the last page. Each machine generates its own IDs without talking to any central service. Twitter invented it for exactly this reason.

What breaks. Three operational things:

  • Clock going backward. If your server's clock jumps back (NTP correction, VM pause, hypervisor glitch), the next ID has a smaller timestamp than the last. Order breaks, duplicates possible. The generator must notice this and block until the clock catches up. Never silently continue.
  • Machine ID management. 1024 machines total. If two pods think they're both machine 42, they can mint the same ID in the same millisecond. Use a short-lived lease from etcd or ZooKeeper; pods must stop generating if they lose the lease.
  • JavaScript can't hold it. JS Number is a float. Safe integers only go up to 2⁵³. Snowflake IDs are bigger than that. Returned as a JSON number, the last few digits get silently rounded on the client. Fix: always send Snowflake IDs as strings in JSON. Twitter's API returns both id and id_str for exactly this reason.

When to pick it. Distributed systems where storage and index size matter. High-write tables. Anywhere you'd want AUTO_INCREMENT but have more than one primary.

UUID v7 — the best of both

UUID v7 — 128 bits, same wire format as v4 48 bits unix ms timestamp ver 12 rand/seq var 62 bits random Timestamp in the high bits → sort order = time order. Random tail avoids collisions.

What it is. A UUID that carries a timestamp in the high bits. Standardized in RFC 9562, May 2024.

Why people pick it. It fixes UUID v4's B-tree problem without leaving the UUID format. Every UUID library, API, database column type, and URL you already have keeps working. Strings sort in time order. Every new ID lands on the rightmost leaf. No machine-ID management, no coordination — the 76 random tail bits are enough to avoid collisions anywhere.

What breaks. Much less than the alternatives:

  • Not strictly in order across generators. Two servers generating IDs in the exact same millisecond produce IDs whose high 48 bits match but whose tails are random. Within that one-millisecond window the order is random. The damage is bounded to one page per ms — nothing like the full-index chaos of v4.
  • Still 16 bytes. Twice the size of a Snowflake BIGINT. Every secondary index pays this.
  • Support varies. Postgres 18 has built-in uuidv7(). Older Postgres needs an extension. MySQL has no built-in — generate in the app, store as BINARY(16).

When to pick it. Any time you were about to pick UUID v4. Same format, same tooling, without the index damage.

AUTO_INCREMENT — still the fastest when you can use it

What it is. A counter inside the database. Each insert asks for the next number.

Why people pick it. Smallest possible ID. Always in order. Always the right choice on a single-node database.

What breaks. Only when you leave single-node:

  • The counter is one row in one place — every insert goes through it. Past tens of thousands of inserts per second, contention shows up.
  • Sharding leaks your scale. Shard 1 holds IDs 1–10M, shard 2 holds 10M–20M — customers can see which shard they're on and roughly how big you are.
  • Multi-master blows up. Two primaries issuing the same integer will hit a duplicate-key during replication.

When to pick it. Single-node databases. The moment a second writer joins, switch to Snowflake or UUID v7.

Benchmark — MySQL 8.0, 10 million rows

Single primary, c6i.2xlarge, 16 GB buffer pool. InnoDB, orders table with five secondary indexes, ~250 byte rows. 200 inserts/sec. Measured after the table crossed 10 million rows.

Insert p99 latency · 10M-row orders table AUTO_INC 3 ms Snowflake 4 ms UUID v7 BIN(16) 6 ms UUID v4 BIN(16) 140 ms UUID v4 CHAR(36) 210 ms 0 50 100 150 200 ms UUID v4 is a different regime — roughly 40× slower, 500 page splits/sec, 2× the disk
MetricAUTO_INCSnowflakeUUID v7 BIN(16)UUID v4 BIN(16)UUID v4 CHAR(36)
Insert p500.9 ms1.0 ms1.3 ms38 ms62 ms
Insert p993 ms4 ms6 ms140 ms210 ms
Page splits per second~0~0~2480520
Buffer pool hit rate99.9%99.9%99.6%86%81%
Total table size5.8 GB6.0 GB8.4 GB13.1 GB22 GB
SELECT by PK p990.4 ms0.4 ms0.5 ms1.1 ms1.4 ms

Read the "page splits per second" row first — UUID v4 splits pages 500 times every second; Snowflake and AUTO_INCREMENT almost never split. Every other number follows from that. Read the CHAR(36) column second: storing UUIDs as strings instead of raw bytes doubles the damage. If you have a CHAR(36) or VARCHAR(36) UUID column in production, change it to BINARY(16) or Postgres uuid. One-line fix, big payoff.

Postgres is a bit different

Postgres stores rows in a heap (not a clustered index). The primary key is a separate B-tree pointing into the heap. That B-tree still splits pages under random inserts, still fragments, still bloats the WAL. Every secondary index is also a B-tree. Same story.

Two Postgres-specific things. Secondary indexes in Postgres don't store the PK — they store the row's physical location (TID), which is mostly ordered, so secondary index fragmentation is lighter than InnoDB's. And VACUUM has to clean up dead index entries; on a UUID-v4 PK table, autovacuum does more random I/O per cycle because the dead entries are scattered across the index.

Postgres 18 has native uuidv7(). Before 18, use the pg_uuidv7 extension. Store UUIDs in a uuid column, never text.

Migrating a live table off UUID v4

You cannot just change the PK with ALTER TABLE on a big table. It rebuilds the clustered index and every secondary index — hours of I/O, replica drift, long locks. Use a dual-write migration:

DUAL-WRITE MIGRATION 1. Add column new_id BIGINT NULL 2. Deploy code that writes BOTH uuid AND new_id on every insert. 3. Backfill old rows in small batches by created_at range: UPDATE ... SET new_id = gen_snowflake() WHERE created_at BETWEEN ? AND ? AND new_id IS NULL LIMIT 10000 Rate-limit: 10k rows every 200 ms. Pause if replica lag > 2 s. 4. Verify: COUNT(*) WHERE new_id IS NULL = 0 5. Add UNIQUE INDEX on new_id. Repeat steps 1–4 for every child table whose foreign key references orders.uuid. 6. Swap the PK (pt-online-schema-change, or blue/green table swap). 7. Drop the uuid column after a grace period (1–2 weeks).

The parts that actually bite: child tables (every FK column needs a parallel one, dual-written), replica lag during backfill (watch it, pause when it's too high), and anything outside your service that hardcoded a UUID — caches, analytics, partner APIs.

How to pick

Decision tree Single-node database? yes no AUTO_INCREMENT BIGINT Need UUID format? yes no UUID v7 Snowflake (BIGINT) About to pick UUID v4? Pick v7 instead — same format, no B-tree damage.

The one-sentence summary

UUID v4 hides information from your customers at the cost of breaking your database. Snowflake and UUID v7 hide the same information without breaking anything. Pick one of the second two unless you have a specific reason not to.