Here's the conversation that happens too often. "The page is slow." "Add Redis." Six weeks later, the page is still slow, the Redis bill is real, and the actual bottleneck was a CPU cache miss on a 50-MB object you were memcpy'ing on every request. Caching is one of those words that means a dozen things and gets used as if it means one. This post walks the whole stack, from L1 cache to CDN, and tries to give a senior-engineer mental model of what each layer is actually for.
Why this matters: the latency budget tells the whole story
The numbers below are approximate, vary by hardware/network, and have been cribbed from a thousand "latency numbers every programmer should know" lists. Internalize the order of magnitude, not the exact figure.
| Operation | Approx. latency | What it tells you |
|---|---|---|
| CPU register / L1 cache hit | ~1 ns | Free. You can do a billion of these per second per core. |
| L2 cache hit | ~3–5 ns | Still essentially free. |
| L3 cache hit | ~10–30 ns | Cheap, but already 30× slower than L1. |
| Main memory (DRAM) | ~100 ns | 100× slower than L1. This is where cache-friendly code matters. |
| Branch misprediction | ~5–15 ns | Free until you do millions per second. |
| OS page cache hit | ~100 ns – 1 µs | Reading a "hot" file is just a memcpy from kernel RAM. |
| SSD random read | ~50–150 µs | 500× slower than RAM. Every cache miss to disk hurts. |
| In-process map lookup (Java/Go) | ~50–500 ns | Closest you get to L1 for application data. |
| Same-DC network round-trip | ~0.5–1 ms | This is what Redis actually costs you. |
| Cross-region round-trip | ~30–100 ms | Don't put your cache here unless you mean it. |
| CDN edge → end user | ~5–30 ms | Hard to beat for static / cacheable content. |
Layer 1 — CPU caches (L1, L2, L3)
Every CPU has a tiny pyramid of caches sitting between the cores and main memory. Reads and writes fetch and flush in cache lines — usually 64 bytes — so when you load one byte of an array, you actually get the next 63 for free. This is why iterating an array of structs is dramatically faster than iterating a linked list of nodes scattered across the heap. Same logical operation; cache-friendliness is the difference.
The kind of bug that bites at this layer:
- False sharing. Two threads write to two different fields that happen to live in the same cache line. The cache-coherence protocol bounces the line between cores on every write. Throughput cratered, no lock visible. Padding the struct fixes it.
- Pointer-chasing. Linked structures scattered across RAM force a fresh cache miss for every node. A flat array with the same data is often 10× faster despite "the same algorithm."
- Working set bigger than L3. When your hot data exceeds the last-level cache, every iteration goes to DRAM. Sometimes the right cache fix is "make the data smaller," not "add a service."
Layer 2 — TLB (Translation Lookaside Buffer)
Often forgotten. The TLB caches virtual-to-physical address translations. Every memory access goes through it. A TLB miss means walking the page table — which itself can miss in the data cache, which means a real DRAM read just to figure out where another DRAM read should go.
You don't tune the TLB directly, but you affect it. Workloads with huge working sets and small (4 KB) pages thrash the TLB. Huge pages (2 MB or 1 GB pages) drastically reduce TLB pressure for memory-heavy databases — which is why Postgres, Redis, and MongoDB all have docs about transparent huge pages. (Spoiler: Redis tells you to disable THP because of latency spikes, while Postgres can benefit. The right answer depends on the workload.)
Layer 3 — OS page cache
This is the layer most people forget exists. The kernel keeps recently-read file pages in RAM. When you read(fd, ...) a file, the kernel checks the page cache first; if the pages are there, it's just a memcpy from kernel memory to your buffer — no disk I/O.
This is why a server that's been up for hours has microsecond reads on the same file that took milliseconds right after boot — the second read is a memcpy, the first was actual disk I/O. It's also why Linux happily shows "all RAM used" on a healthy server: the kernel filled it with cached file pages, and will free them the moment any process needs more.
The page cache is invisible to your app — you don't allocate it, you don't manage it, you just benefit from it. But it shapes performance dramatically:
- Database files, log files, static assets — all benefit automatically.
- Memory-mapped (
mmap) files give you a pointer that's backed by the page cache. Read it like memory; the kernel pages from disk on demand. - If you do
O_DIRECTI/O, you bypass the page cache entirely. Sometimes that's what you want (databases manage their own buffer pool); usually it's a footgun.
Layer 4 — Database buffer pool
Databases don't usually trust the OS page cache alone. They run their own internal cache of hot data pages — InnoDB calls it innodb_buffer_pool_size, Postgres calls it shared_buffers, SQL Server calls it the buffer pool. The DB knows things the kernel doesn't (which pages are part of the same B-tree, which indexes are hot, which queries are scanning what), so it caches with smarter eviction.
This is why every "Postgres is slow" investigation starts with: does your working set fit in shared_buffers? Same query against the same table, hot vs cold buffer pool, can be 100× different. The cache is doing its job; it just hasn't been warmed.
Layer 5 — In-process application cache
This is the cache your code holds inside the same process. A HashMap<K,V> wrapped in something that evicts. Caffeine in Java, lru-cache in Node, functools.lru_cache in Python, Go's sync.Map with a custom eviction layer.
This is the layer that's underused. People reach for Redis when a 1 MB lookup table that changes once a day could just be a Caffeine cache. Network round-trip vs in-process map lookup is roughly 1000× difference in latency.
The trade-off: every replica has its own cache, so invalidation across replicas requires either short TTLs, a pub/sub channel ("the data changed, drop your cache"), or a versioned key approach. The right choice depends on freshness requirements.
Layer 6 — Distributed cache (Redis, Memcached)
Now we get to the layer everyone calls "the cache." Distributed caches solve a different problem than in-process caches: shared state across many app instances. Session data, rate-limit counters, computed-result cache for queries that are expensive even after you've already paid for them once on another host.
The win: every replica sees the same cache. Set a key on host A, read it from host B. Cache hit ratios stay high because the cache is shared, not duplicated.
The cost: every lookup is a network call. ~0.5–1 ms in the same data center. That's 1000× slower than an in-process map. So the rule of thumb is: cache things in Redis when (a) the underlying compute is much more expensive than 1 ms, or (b) you need cross-host visibility. Don't cache a string concatenation in Redis. You'd be paying network latency to save nanoseconds of CPU.
Layer 7 — HTTP caching
This is the layer most backend engineers underuse. HTTP has had built-in caching semantics for thirty years. Every browser, every proxy, every CDN, every cache-aware library understands them. Cache-Control, ETag, If-None-Match, Last-Modified, If-Modified-Since — these are not legacy. They are why a properly-configured asset endpoint never hits your origin twice for the same client.
What you get for free when you set these headers correctly:
- Browser cache for repeat visits (zero network).
- Reverse proxy / CDN cache for everyone in the same region.
- Conditional revalidation (
304 Not Modified) — full RTT but tiny body. - Stale-while-revalidate — clients get instant responses while the cache refreshes in the background. This is criminally underused.
Layer 8 — CDN / edge cache
A CDN is an HTTP cache geographically distributed close to your users. CloudFront, Cloudflare, Fastly, Akamai. The math is simple: a user in Mumbai talking to your origin in Virginia is paying 200 ms of round-trip per request. Same user hitting a CDN POP in Mumbai pays 5 ms. The cache layer that physically moves the data closer to the user is doing more work than any application-level cache can.
What actually goes on a CDN:
- Static assets — JS, CSS, images. Obvious.
- API responses with cache-friendly headers — surprisingly often, GET endpoints can be safely edge-cached for seconds to minutes. Stale-while-revalidate makes this nearly invisible to users.
- Whole pages — for content sites, edge-render or edge-cache the HTML.
If-None-Matchfrom the edge to the origin handles invalidation cheaply. - Programmable edge logic — Cloudflare Workers, Lambda@Edge — cache + transform at the edge. Powerful and easy to misuse.
Layer 9 — Browser cache
The cache closest to the user. Disk cache, memory cache, service-worker cache. Driven entirely by HTTP headers (and, for SPAs, Cache-Storage via service workers). The right Cache-Control on your bundle is the difference between "page loads instantly on second visit" and "user redownloads 2 MB every time."
Pattern that works: serve assets at versioned URLs (app.7f2c91.js) with Cache-Control: public, max-age=31536000, immutable. The URL changes when the asset changes; the browser caches the old one essentially forever. No invalidation needed — different content, different URL.
Layer 10 — DNS resolver caching
Don't laugh. DNS is a cache. Every getaddrinfo goes through layers of caching: local resolver, OS-level cache, ISP recursive resolver, authoritative server. TTLs control how long an answer is reusable. This is also where caching becomes a liability — long DNS TTLs mean failover after an outage takes minutes to propagate. Short TTLs mean more recursive lookups but faster recovery.
If your DR plan is "we'll change the DNS record," you'd better know what your TTL is, and what your downstream resolvers are doing with it.
Layer 11 — Reverse proxies (Varnish, nginx, Envoy)
Sitting between your CDN and your application is often a proxy that can also cache. nginx with proxy_cache, Varnish dedicated to it, Envoy with HTTP filters. This layer is useful for caching things that aren't quite static (so they don't go to the CDN) but are too expensive to recompute on every request — search-results pages, expensive aggregations, dashboard tiles.
The trade-off vs CDN: closer to your origin (lower TTFB to your origin still matters), under your operational control (you can purge, pre-warm, write VCL), but doesn't move data closer to users.
Layer 12 — Database query / plan cache
The database itself caches a few things you can't see from your app:
- Query plan cache. Parsing and planning a SQL query is non-trivial. Prepared statements let the DB cache the plan and reuse it across executions.
- Result cache. Some DBs (older MySQL, some commercial) cache query results. Mostly out of fashion — invalidation hell, didn't scale to write-heavy workloads. MySQL removed it in 8.0.
- Connection-level state. Session caches, temporary tables, cached metadata. Reusing a connection (via a pool) keeps these warm.
Other places caches hide
- ORM identity map / first-level cache. Hibernate, EF, ActiveRecord all cache loaded objects within a session/transaction. Same row twice = one DB round-trip.
- ORM second-level cache. Cross-session cache, often backed by Caffeine or Redis. Powerful, dangerous if invalidation isn't tight.
- Connection pools. Caching the expensive thing (TCP + auth + TLS handshake) so each request doesn't pay it.
- Compiled regex cache. Most languages cache compiled regex. Java doesn't (per
String.matches) — bug or feature, opinions vary. - Network stack caches. ARP cache, route cache, conntrack table — invisible until they fill up.
- Build / compiler caches. ccache, sccache, Bazel remote cache, Turborepo. Not runtime, but the same ideas.
The hard part is invalidation
"There are only two hard things in computer science: cache invalidation and naming things." It's a joke and a warning. Invalidation is the part that makes caching dangerous. Every layer above has its own model:
| Layer | Invalidation model | What goes wrong |
|---|---|---|
| CPU caches | Cache-coherence protocol (MESI/MOESI) | False sharing, write-amp on shared lines |
| OS page cache | LRU + writes invalidate pages | Rare; mostly invisible |
| DB buffer pool | Page-level, integrated with the engine | Cold cache after restart, eviction storms |
| In-process cache | TTL, manual invalidate(), pub/sub | Stale data per replica after writes |
| Distributed cache | TTL, write-through, manual delete on update | Race between cache delete and DB write |
| HTTP / CDN | TTL via Cache-Control, ETag revalidation, purge API | Stale content after deploy if purge missed |
| Browser | TTL + URL versioning | Users on old bundles after release; Ctrl+Shift+R ritual |
| DNS | TTL only | Slow failover, "DNS is cached somewhere we don't control" |
The pattern: every cache is a bet that data won't change before the cache expires. The more layers you stack, the more bets you're making in parallel. When the data changes, you have to invalidate every layer that has a copy — or accept stale reads at every one of them. This is why cache invalidation is hard: it's a distributed problem with no global clock.
The patterns you should know by name
Failure modes worth a name
- Thundering herd / dogpile. Cached value expires, 1000 concurrent requests miss simultaneously, all 1000 hit the origin. Mitigation: single-flight, jittered TTLs, stale-while-revalidate.
- Hot key. One key gets disproportionate traffic (a celebrity, a popular product). Distributed cache nodes that own that key get crushed. Mitigation: replicate hot keys across nodes, in-process front-cache, sharding by content rather than key.
- Cache stampede on cold start. Deploy / restart, every cache empty, every request hits the DB. Mitigation: pre-warm caches, gradual deploy, request coalescing.
- Negative-cache stampede. A query returns "not found." If you don't cache that, every miss re-queries forever. Cache the negative answer with a short TTL.
- Stale fan-out on deploy. Deploy new code that writes a different shape; old replicas' caches still have old shape. Mitigation: version your cache keys, invalidate aggressively on deploy.
- Write-skew between cache and DB. Update DB, then delete cache key — but the read got there first and re-populated cache with the old value. Mitigation: delete cache after DB write, with double-delete after a short delay; or use write-through.
A decision framework: which layer to cache at
| If the bottleneck is… | The right cache layer is… | Why |
|---|---|---|
| Hot loop touching memory | CPU cache (data layout) | Don't add a service. Make the data fit a cache line. |
| Reading the same file repeatedly | OS page cache (already happening) | Just give the server enough RAM. |
| Same DB query, same row, repeatedly | DB buffer pool, then in-process | Buffer pool is free. In-process beats Redis if the row is hot per replica. |
| Same query result across replicas | Distributed cache (Redis / Memcached) | Shared state is the whole point of going over the network. |
| Static asset to many users | HTTP cache + CDN | Move the bytes geographically. Free, standard, scales infinitely. |
| Expensive computation per request | In-process or distributed, depending on size + freshness | Memoize the expensive function. TTL it. |
| Repeat reads from the same client | Browser cache + ETag | The client already has it. Just say "still fresh." |
| Slow DNS resolution / TLS handshake | Connection pool / DNS pinning | Cache the connection, not the answer. |
What "good caching" looks like in practice
A typical well-cached web stack might look like this for a single user request to a product page:
- Browser cache hit on JS/CSS/images — zero network for assets. (HTTP cache.)
- HTML served from CDN edge — 5–20 ms TTFB from a nearby POP. (CDN cache.)
- API call from page → reverse proxy with short-lived cache for popular endpoints. (Proxy cache.)
- Application server: in-process Caffeine cache for feature flags, config, lookup tables. (In-process.)
- Application server: Redis for shared session, rate limits, computed query results. (Distributed.)
- If Redis miss: query Postgres, where the index pages are in
shared_buffersand the file pages are in OS page cache. (DB + OS.) - Postgres returns rows; CPU L1/L2 keep the hot tuples warm during processing. (CPU.)
The user sees a fast page. Behind it, a dozen caches did their job. None of them were "the cache." They were a cache system.
Closing take
The next time someone says "we need a cache," ask: at which layer. The answer "Redis" is sometimes right and often a habit. The cheaper answer is usually one or two layers up from where the discussion started.
Caching is plumbing. The real skill is knowing which pipe to add and which to leave alone, knowing how each one fails, and remembering that every cache is a small lie about freshness that you've decided to live with. The lies that work are the ones whose invalidation you've thought through. The ones that bite you are always the layer you forgot was even caching.