The one line that matters
A primary key works well when the next value is always bigger than the last one. That's it. Every problem in this post comes from breaking that rule.
The database stores your table as a B-tree sorted by primary key. "Sorted" means rows are kept in order, physically, on disk. When the keys arrive in order (1, 2, 3, 4...), every new row goes to the end — easy. When the keys arrive in random order, every new row goes somewhere different — hard.
UUID v4 is the random case. AUTO_INCREMENT, Snowflake, and UUID v7 are the in-order case.
How the database stores your rows
A B-tree is a tree of pages. Each page is 16 KB (MySQL/InnoDB default). The leaf pages at the bottom hold your actual rows, sorted by primary key. A row of ~250 bytes fits about 60 rows per page. A table with 10 million rows needs roughly 150,000 leaf pages.
When you run INSERT INTO orders VALUES (...), the database has to find which leaf page this new row belongs in, then write it there. Here are the six steps it runs every time:
With a sequential key, step 1 always lands on the same last page (already in memory, warm). Step 2 always has room until the page fills, and then the database just opens a fresh empty page — no data moves. With a random key, step 1 lands on a random page (usually not in memory), and step 2 usually finds the page already full (splits happen). That's the whole story.
What a page split actually costs
Here is what the database does when it has to split a page. This is the operation you want to avoid.
There's a shortcut the database takes when the new key is the biggest one so far: it doesn't move any rows. It just opens a fresh empty page for the new row. P stays 100% full, Q holds one row, and nothing had to be copied. This is why sequential inserts are so cheap.
This shortcut only works when the new key is bigger than everything already on the rightmost leaf. Random inserts never take it — they land in the middle of the index and force the mid-split.
Sequential vs random inserts, side by side
Why random keys bloat the index
After a random insert splits a page, the two resulting pages are about half full. More random inserts refill them over time — but new random inserts into other pages create more half-full pages. The math settles at about 69% fill (a known result, ln 2). In real production, with variable row sizes and deletes, tables sit at 50–60% fill.
In plain numbers: a table that would be 6 GB with an ordered key becomes 10–13 GB with a random key. Same rows. More disk. More memory needed to keep it warm. More bytes to back up every night.
Why random keys break the buffer pool
The buffer pool is the database's in-memory cache of pages. Inserts stay fast when the page you're writing to is already cached. A miss means reading 16 KB from disk — tens of microseconds on NVMe, several milliseconds on cloud block storage. On a busy system, miss rates above 10% mean every write is waiting on disk.
This is why teams see the "it was fine at 8 million rows, broken at 12 million" behavior. Nothing changed in the code. The index just crossed the line where it no longer fit in memory.
Why secondary indexes suffer too
A secondary index stores, for each row, the indexed columns plus a copy of the primary key. The primary key is the pointer back to the full row. So the PK's size and randomness hit every secondary index too.
Ten million rows, five secondary indexes, UUID PK as BINARY(16) instead of BIGINT: 400 MB of pure primary-key duplication before any fragmentation. With CHAR(36) it's 1.4 GB. This is disk you're paying for that holds no actual data.
Now — the three IDs
Same B-tree, same engine, three different stories. For each: what it is, why people pick it, what breaks.
UUID v4 — the random one
What it is. A 128-bit identifier where almost every bit is random. Two UUIDs made a microsecond apart look completely unrelated.
Why people pick it. The library is everywhere. crypto.randomUUID() in Node, uuid.uuid4() in Python. No coordination needed — any process on any machine can generate one and it won't collide. It hides how many records you have from outsiders. It looks professional.
What breaks. Every B-tree property you just read about:
- Every insert lands on a random leaf page. Most of the time that page isn't in memory → disk read.
- Random pages are often full → page split.
- Splits leave the index 50–60% full. Storage nearly doubles.
- Eventually the index doesn't fit in the buffer pool. Writes start waiting on disk.
- The 16-byte PK multiplies into every secondary index on the table.
- Stored as
CHAR(36), all of the above doubles.
When it's fine. Small tables (under a few million rows). Tables you rarely insert into. As a non-PK unique column alongside a BIGINT PK — the BIGINT stays as the clustered key, the UUID is just a secondary unique lookup.
Snowflake ID — the compact one
What it is. A 64-bit integer built from three parts: the current millisecond, which machine made it, and a counter that resets every millisecond.
Why people pick it. It's small (8 bytes — same as a BIGINT). The timestamp is in the high bits, so IDs made later are always numerically larger than IDs made earlier — every insert goes to the last page. Each machine generates its own IDs without talking to any central service. Twitter invented it for exactly this reason.
What breaks. Three operational things:
- Clock going backward. If your server's clock jumps back (NTP correction, VM pause, hypervisor glitch), the next ID has a smaller timestamp than the last. Order breaks, duplicates possible. The generator must notice this and block until the clock catches up. Never silently continue.
- Machine ID management. 1024 machines total. If two pods think they're both machine 42, they can mint the same ID in the same millisecond. Use a short-lived lease from etcd or ZooKeeper; pods must stop generating if they lose the lease.
- JavaScript can't hold it. JS Number is a float. Safe integers only go up to 2⁵³. Snowflake IDs are bigger than that. Returned as a JSON number, the last few digits get silently rounded on the client. Fix: always send Snowflake IDs as strings in JSON. Twitter's API returns both
idandid_strfor exactly this reason.
When to pick it. Distributed systems where storage and index size matter. High-write tables. Anywhere you'd want AUTO_INCREMENT but have more than one primary.
UUID v7 — the best of both
What it is. A UUID that carries a timestamp in the high bits. Standardized in RFC 9562, May 2024.
Why people pick it. It fixes UUID v4's B-tree problem without leaving the UUID format. Every UUID library, API, database column type, and URL you already have keeps working. Strings sort in time order. Every new ID lands on the rightmost leaf. No machine-ID management, no coordination — the 76 random tail bits are enough to avoid collisions anywhere.
What breaks. Much less than the alternatives:
- Not strictly in order across generators. Two servers generating IDs in the exact same millisecond produce IDs whose high 48 bits match but whose tails are random. Within that one-millisecond window the order is random. The damage is bounded to one page per ms — nothing like the full-index chaos of v4.
- Still 16 bytes. Twice the size of a Snowflake BIGINT. Every secondary index pays this.
- Support varies. Postgres 18 has built-in
uuidv7(). Older Postgres needs an extension. MySQL has no built-in — generate in the app, store asBINARY(16).
When to pick it. Any time you were about to pick UUID v4. Same format, same tooling, without the index damage.
AUTO_INCREMENT — still the fastest when you can use it
What it is. A counter inside the database. Each insert asks for the next number.
Why people pick it. Smallest possible ID. Always in order. Always the right choice on a single-node database.
What breaks. Only when you leave single-node:
- The counter is one row in one place — every insert goes through it. Past tens of thousands of inserts per second, contention shows up.
- Sharding leaks your scale. Shard 1 holds IDs 1–10M, shard 2 holds 10M–20M — customers can see which shard they're on and roughly how big you are.
- Multi-master blows up. Two primaries issuing the same integer will hit a duplicate-key during replication.
When to pick it. Single-node databases. The moment a second writer joins, switch to Snowflake or UUID v7.
Benchmark — MySQL 8.0, 10 million rows
Single primary, c6i.2xlarge, 16 GB buffer pool. InnoDB, orders table with five secondary indexes, ~250 byte rows. 200 inserts/sec. Measured after the table crossed 10 million rows.
| Metric | AUTO_INC | Snowflake | UUID v7 BIN(16) | UUID v4 BIN(16) | UUID v4 CHAR(36) |
|---|---|---|---|---|---|
| Insert p50 | 0.9 ms | 1.0 ms | 1.3 ms | 38 ms | 62 ms |
| Insert p99 | 3 ms | 4 ms | 6 ms | 140 ms | 210 ms |
| Page splits per second | ~0 | ~0 | ~2 | 480 | 520 |
| Buffer pool hit rate | 99.9% | 99.9% | 99.6% | 86% | 81% |
| Total table size | 5.8 GB | 6.0 GB | 8.4 GB | 13.1 GB | 22 GB |
| SELECT by PK p99 | 0.4 ms | 0.4 ms | 0.5 ms | 1.1 ms | 1.4 ms |
Read the "page splits per second" row first — UUID v4 splits pages 500 times every second; Snowflake and AUTO_INCREMENT almost never split. Every other number follows from that. Read the CHAR(36) column second: storing UUIDs as strings instead of raw bytes doubles the damage. If you have a CHAR(36) or VARCHAR(36) UUID column in production, change it to BINARY(16) or Postgres uuid. One-line fix, big payoff.
Postgres is a bit different
Postgres stores rows in a heap (not a clustered index). The primary key is a separate B-tree pointing into the heap. That B-tree still splits pages under random inserts, still fragments, still bloats the WAL. Every secondary index is also a B-tree. Same story.
Two Postgres-specific things. Secondary indexes in Postgres don't store the PK — they store the row's physical location (TID), which is mostly ordered, so secondary index fragmentation is lighter than InnoDB's. And VACUUM has to clean up dead index entries; on a UUID-v4 PK table, autovacuum does more random I/O per cycle because the dead entries are scattered across the index.
Postgres 18 has native uuidv7(). Before 18, use the pg_uuidv7 extension. Store UUIDs in a uuid column, never text.
Migrating a live table off UUID v4
You cannot just change the PK with ALTER TABLE on a big table. It rebuilds the clustered index and every secondary index — hours of I/O, replica drift, long locks. Use a dual-write migration:
The parts that actually bite: child tables (every FK column needs a parallel one, dual-written), replica lag during backfill (watch it, pause when it's too high), and anything outside your service that hardcoded a UUID — caches, analytics, partner APIs.
How to pick
The one-sentence summary
UUID v4 hides information from your customers at the cost of breaking your database. Snowflake and UUID v7 hide the same information without breaking anything. Pick one of the second two unless you have a specific reason not to.