Caching Is Not 'Add Redis' — A Layer-by-Layer Tour from CPU to CDN

TL;DR

Caching is a stack, not a service. CPU caches, TLB, OS page cache, DB buffer pool, in-process LRU, distributed cache, HTTP cache, CDN, browser, DNS — your data is cached at a dozen places before Redis is even one of them.
Each layer has its own latency, capacity, granularity, invalidation model, and failure mode. They are not interchangeable. Picking the right layer is half the optimization.
Redis is one cache, not the cache. The cheapest cache hit is the one that never crosses the network.
The hard part is never "where do I cache" — it's invalidation. Plan that first, pick the layer second.
Patterns to know: cache-aside, read-through, write-through, write-behind, stale-while-revalidate, request coalescing. Failure modes to know: thundering herd, hot-key meltdown, negative-cache stampede, stale fan-out on deploy.

Here's the conversation that happens too often. "The page is slow." "Add Redis." Six weeks later, the page is still slow, the Redis bill is real, and the actual bottleneck was a CPU cache miss on a 50-MB object you were memcpy'ing on every request. Caching is one of those words that means a dozen things and gets used as if it means one. This post walks the whole stack, from L1 cache to CDN, and tries to give a senior-engineer mental model of what each layer is actually for.

Why this matters: the latency budget tells the whole story

The numbers below are approximate, vary by hardware/network, and have been cribbed from a thousand "latency numbers every programmer should know" lists. Internalize the order of magnitude, not the exact figure.

Operation	Approx. latency	What it tells you
CPU register / L1 cache hit	~1 ns	Free. You can do a billion of these per second per core.
L2 cache hit	~3–5 ns	Still essentially free.
L3 cache hit	~10–30 ns	Cheap, but already 30× slower than L1.
Main memory (DRAM)	~100 ns	100× slower than L1. This is where cache-friendly code matters.
Branch misprediction	~5–15 ns	Free until you do millions per second.
OS page cache hit	~100 ns – 1 µs	Reading a "hot" file is just a memcpy from kernel RAM.
SSD random read	~50–150 µs	500× slower than RAM. Every cache miss to disk hurts.
In-process map lookup (Java/Go)	~50–500 ns	Closest you get to L1 for application data.
Same-DC network round-trip	~0.5–1 ms	This is what Redis actually costs you.
Cross-region round-trip	~30–100 ms	Don't put your cache here unless you mean it.
CDN edge → end user	~5–30 ms	Hard to beat for static / cacheable content.

The cheapest cache hit is the one that never leaves the CPU. The next cheapest is the one that never leaves the process. The next is the one that never leaves the host. Going to Redis is the fourth cheapest option, not the first.

Layer 1 — CPU caches (L1, L2, L3)

Every CPU has a tiny pyramid of caches sitting between the cores and main memory. Reads and writes fetch and flush in cache lines — usually 64 bytes — so when you load one byte of an array, you actually get the next 63 for free. This is why iterating an array of structs is dramatically faster than iterating a linked list of nodes scattered across the heap. Same logical operation; cache-friendliness is the difference.

The kind of bug that bites at this layer:

False sharing. Two threads write to two different fields that happen to live in the same cache line. The cache-coherence protocol bounces the line between cores on every write. Throughput cratered, no lock visible. Padding the struct fixes it.
Pointer-chasing. Linked structures scattered across RAM force a fresh cache miss for every node. A flat array with the same data is often 10× faster despite "the same algorithm."
Working set bigger than L3. When your hot data exceeds the last-level cache, every iteration goes to DRAM. Sometimes the right cache fix is "make the data smaller," not "add a service."

Real-world version: a hot loop processing one field of a 200-byte struct. Switching to a struct-of-arrays layout (so the hot field is contiguous in memory) made the same code 4× faster. Zero new infrastructure, zero new services, zero new failure modes. That's a cache fix.

Layer 2 — TLB (Translation Lookaside Buffer)

Often forgotten. The TLB caches virtual-to-physical address translations. Every memory access goes through it. A TLB miss means walking the page table — which itself can miss in the data cache, which means a real DRAM read just to figure out where another DRAM read should go.

You don't tune the TLB directly, but you affect it. Workloads with huge working sets and small (4 KB) pages thrash the TLB. Huge pages (2 MB or 1 GB pages) drastically reduce TLB pressure for memory-heavy databases — which is why Postgres, Redis, and MongoDB all have docs about transparent huge pages. (Spoiler: Redis tells you to disable THP because of latency spikes, while Postgres can benefit. The right answer depends on the workload.)

Layer 3 — OS page cache

This is the layer most people forget exists. The kernel keeps recently-read file pages in RAM. When you read(fd, ...) a file, the kernel checks the page cache first; if the pages are there, it's just a memcpy from kernel memory to your buffer — no disk I/O.

This is why a server that's been up for hours has microsecond reads on the same file that took milliseconds right after boot — the second read is a memcpy, the first was actual disk I/O. It's also why Linux happily shows "all RAM used" on a healthy server: the kernel filled it with cached file pages, and will free them the moment any process needs more.

The page cache is invisible to your app — you don't allocate it, you don't manage it, you just benefit from it. But it shapes performance dramatically:

Database files, log files, static assets — all benefit automatically.
Memory-mapped (mmap) files give you a pointer that's backed by the page cache. Read it like memory; the kernel pages from disk on demand.
If you do O_DIRECT I/O, you bypass the page cache entirely. Sometimes that's what you want (databases manage their own buffer pool); usually it's a footgun.

Layer 4 — Database buffer pool

Databases don't usually trust the OS page cache alone. They run their own internal cache of hot data pages — InnoDB calls it innodb_buffer_pool_size, Postgres calls it shared_buffers, SQL Server calls it the buffer pool. The DB knows things the kernel doesn't (which pages are part of the same B-tree, which indexes are hot, which queries are scanning what), so it caches with smarter eviction.

This is why every "Postgres is slow" investigation starts with: does your working set fit in shared_buffers? Same query against the same table, hot vs cold buffer pool, can be 100× different. The cache is doing its job; it just hasn't been warmed.

The two-cache problem: data lives both in the DB buffer pool and the OS page cache, which is duplication and can be wasteful. Most relational DBs default to "let both happen, they're both LRU-ish, RAM is cheap." Some embedded engines (RocksDB) prefer to manage their own block cache and discourage the OS from caching the same blocks twice.

Layer 5 — In-process application cache

This is the cache your code holds inside the same process. A HashMap<K,V> wrapped in something that evicts. Caffeine in Java, lru-cache in Node, functools.lru_cache in Python, Go's sync.Map with a custom eviction layer.

This is the layer that's underused. People reach for Redis when a 1 MB lookup table that changes once a day could just be a Caffeine cache. Network round-trip vs in-process map lookup is roughly 1000× difference in latency.

The trade-off: every replica has its own cache, so invalidation across replicas requires either short TTLs, a pub/sub channel ("the data changed, drop your cache"), or a versioned key approach. The right choice depends on freshness requirements.

Layer 6 — Distributed cache (Redis, Memcached)

Now we get to the layer everyone calls "the cache." Distributed caches solve a different problem than in-process caches: shared state across many app instances. Session data, rate-limit counters, computed-result cache for queries that are expensive even after you've already paid for them once on another host.

The win: every replica sees the same cache. Set a key on host A, read it from host B. Cache hit ratios stay high because the cache is shared, not duplicated.

The cost: every lookup is a network call. ~0.5–1 ms in the same data center. That's 1000× slower than an in-process map. So the rule of thumb is: cache things in Redis when (a) the underlying compute is much more expensive than 1 ms, or (b) you need cross-host visibility. Don't cache a string concatenation in Redis. You'd be paying network latency to save nanoseconds of CPU.

The Redis-as-default trap. "Cache it in Redis" is a reflex, not a decision. If the underlying operation takes 200 µs and Redis adds 800 µs of round-trip, you've made things slower. Always benchmark before promoting. And remember: an in-process cache in front of Redis is often the right answer — local cache for the 99% case, Redis for the 1% miss.

Layer 7 — HTTP caching

This is the layer most backend engineers underuse. HTTP has had built-in caching semantics for thirty years. Every browser, every proxy, every CDN, every cache-aware library understands them. Cache-Control, ETag, If-None-Match, Last-Modified, If-Modified-Since — these are not legacy. They are why a properly-configured asset endpoint never hits your origin twice for the same client.

What you get for free when you set these headers correctly:

Browser cache for repeat visits (zero network).
Reverse proxy / CDN cache for everyone in the same region.
Conditional revalidation (304 Not Modified) — full RTT but tiny body.
Stale-while-revalidate — clients get instant responses while the cache refreshes in the background. This is criminally underused.

Senior-engineer move: before reaching for any application-level caching, ask "what would HTTP caching do here?" Static assets, immutable resources (versioned URLs), seldom-changing API responses — these can all be solved with headers, no Redis, no app cache, no code.

Layer 8 — CDN / edge cache

A CDN is an HTTP cache geographically distributed close to your users. CloudFront, Cloudflare, Fastly, Akamai. The math is simple: a user in Mumbai talking to your origin in Virginia is paying 200 ms of round-trip per request. Same user hitting a CDN POP in Mumbai pays 5 ms. The cache layer that physically moves the data closer to the user is doing more work than any application-level cache can.

What actually goes on a CDN:

Static assets — JS, CSS, images. Obvious.
API responses with cache-friendly headers — surprisingly often, GET endpoints can be safely edge-cached for seconds to minutes. Stale-while-revalidate makes this nearly invisible to users.
Whole pages — for content sites, edge-render or edge-cache the HTML. If-None-Match from the edge to the origin handles invalidation cheaply.
Programmable edge logic — Cloudflare Workers, Lambda@Edge — cache + transform at the edge. Powerful and easy to misuse.

Layer 9 — Browser cache

The cache closest to the user. Disk cache, memory cache, service-worker cache. Driven entirely by HTTP headers (and, for SPAs, Cache-Storage via service workers). The right Cache-Control on your bundle is the difference between "page loads instantly on second visit" and "user redownloads 2 MB every time."

Pattern that works: serve assets at versioned URLs (app.7f2c91.js) with Cache-Control: public, max-age=31536000, immutable. The URL changes when the asset changes; the browser caches the old one essentially forever. No invalidation needed — different content, different URL.

Layer 10 — DNS resolver caching

Don't laugh. DNS is a cache. Every getaddrinfo goes through layers of caching: local resolver, OS-level cache, ISP recursive resolver, authoritative server. TTLs control how long an answer is reusable. This is also where caching becomes a liability — long DNS TTLs mean failover after an outage takes minutes to propagate. Short TTLs mean more recursive lookups but faster recovery.

If your DR plan is "we'll change the DNS record," you'd better know what your TTL is, and what your downstream resolvers are doing with it.

Layer 11 — Reverse proxies (Varnish, nginx, Envoy)

Sitting between your CDN and your application is often a proxy that can also cache. nginx with proxy_cache, Varnish dedicated to it, Envoy with HTTP filters. This layer is useful for caching things that aren't quite static (so they don't go to the CDN) but are too expensive to recompute on every request — search-results pages, expensive aggregations, dashboard tiles.

The trade-off vs CDN: closer to your origin (lower TTFB to your origin still matters), under your operational control (you can purge, pre-warm, write VCL), but doesn't move data closer to users.

Layer 12 — Database query / plan cache

The database itself caches a few things you can't see from your app:

Query plan cache. Parsing and planning a SQL query is non-trivial. Prepared statements let the DB cache the plan and reuse it across executions.
Result cache. Some DBs (older MySQL, some commercial) cache query results. Mostly out of fashion — invalidation hell, didn't scale to write-heavy workloads. MySQL removed it in 8.0.
Connection-level state. Session caches, temporary tables, cached metadata. Reusing a connection (via a pool) keeps these warm.

Other places caches hide

ORM identity map / first-level cache. Hibernate, EF, ActiveRecord all cache loaded objects within a session/transaction. Same row twice = one DB round-trip.
ORM second-level cache. Cross-session cache, often backed by Caffeine or Redis. Powerful, dangerous if invalidation isn't tight.
Connection pools. Caching the expensive thing (TCP + auth + TLS handshake) so each request doesn't pay it.
Compiled regex cache. Most languages cache compiled regex. Java doesn't (per String.matches) — bug or feature, opinions vary.
Network stack caches. ARP cache, route cache, conntrack table — invisible until they fill up.
Build / compiler caches. ccache, sccache, Bazel remote cache, Turborepo. Not runtime, but the same ideas.

The hard part is invalidation

"There are only two hard things in computer science: cache invalidation and naming things." It's a joke and a warning. Invalidation is the part that makes caching dangerous. Every layer above has its own model:

Layer	Invalidation model	What goes wrong
CPU caches	Cache-coherence protocol (MESI/MOESI)	False sharing, write-amp on shared lines
OS page cache	LRU + writes invalidate pages	Rare; mostly invisible
DB buffer pool	Page-level, integrated with the engine	Cold cache after restart, eviction storms
In-process cache	TTL, manual `invalidate()`, pub/sub	Stale data per replica after writes
Distributed cache	TTL, write-through, manual delete on update	Race between cache delete and DB write
HTTP / CDN	TTL via `Cache-Control`, ETag revalidation, purge API	Stale content after deploy if purge missed
Browser	TTL + URL versioning	Users on old bundles after release; `Ctrl+Shift+R` ritual
DNS	TTL only	Slow failover, "DNS is cached somewhere we don't control"

The pattern: every cache is a bet that data won't change before the cache expires. The more layers you stack, the more bets you're making in parallel. When the data changes, you have to invalidate every layer that has a copy — or accept stale reads at every one of them. This is why cache invalidation is hard: it's a distributed problem with no global clock.

The patterns you should know by name

Failure modes worth a name

Thundering herd / dogpile. Cached value expires, 1000 concurrent requests miss simultaneously, all 1000 hit the origin. Mitigation: single-flight, jittered TTLs, stale-while-revalidate.
Hot key. One key gets disproportionate traffic (a celebrity, a popular product). Distributed cache nodes that own that key get crushed. Mitigation: replicate hot keys across nodes, in-process front-cache, sharding by content rather than key.
Cache stampede on cold start. Deploy / restart, every cache empty, every request hits the DB. Mitigation: pre-warm caches, gradual deploy, request coalescing.
Negative-cache stampede. A query returns "not found." If you don't cache that, every miss re-queries forever. Cache the negative answer with a short TTL.
Stale fan-out on deploy. Deploy new code that writes a different shape; old replicas' caches still have old shape. Mitigation: version your cache keys, invalidate aggressively on deploy.
Write-skew between cache and DB. Update DB, then delete cache key — but the read got there first and re-populated cache with the old value. Mitigation: delete cache after DB write, with double-delete after a short delay; or use write-through.

A decision framework: which layer to cache at

If the bottleneck is…	The right cache layer is…	Why
Hot loop touching memory	CPU cache (data layout)	Don't add a service. Make the data fit a cache line.
Reading the same file repeatedly	OS page cache (already happening)	Just give the server enough RAM.
Same DB query, same row, repeatedly	DB buffer pool, then in-process	Buffer pool is free. In-process beats Redis if the row is hot per replica.
Same query result across replicas	Distributed cache (Redis / Memcached)	Shared state is the whole point of going over the network.
Static asset to many users	HTTP cache + CDN	Move the bytes geographically. Free, standard, scales infinitely.
Expensive computation per request	In-process or distributed, depending on size + freshness	Memoize the expensive function. TTL it.
Repeat reads from the same client	Browser cache + ETag	The client already has it. Just say "still fresh."
Slow DNS resolution / TLS handshake	Connection pool / DNS pinning	Cache the connection, not the answer.

The order to think in: can I avoid the work entirely (HTTP caching, ETag)? Can the OS or DB do this for me already (page cache, buffer pool)? Can I keep this in-process (Caffeine, sync.Map)? Only then: do I need a distributed cache (Redis)? Reaching for Redis first skips three layers that are free, faster, and don't add a network dependency.

What "good caching" looks like in practice

A typical well-cached web stack might look like this for a single user request to a product page:

Browser cache hit on JS/CSS/images — zero network for assets. (HTTP cache.)
HTML served from CDN edge — 5–20 ms TTFB from a nearby POP. (CDN cache.)
API call from page → reverse proxy with short-lived cache for popular endpoints. (Proxy cache.)
Application server: in-process Caffeine cache for feature flags, config, lookup tables. (In-process.)
Application server: Redis for shared session, rate limits, computed query results. (Distributed.)
If Redis miss: query Postgres, where the index pages are in shared_buffers and the file pages are in OS page cache. (DB + OS.)
Postgres returns rows; CPU L1/L2 keep the hot tuples warm during processing. (CPU.)

The user sees a fast page. Behind it, a dozen caches did their job. None of them were "the cache." They were a cache system.

Closing take

The next time someone says "we need a cache," ask: at which layer. The answer "Redis" is sometimes right and often a habit. The cheaper answer is usually one or two layers up from where the discussion started.

Caching is plumbing. The real skill is knowing which pipe to add and which to leave alone, knowing how each one fails, and remembering that every cache is a small lie about freshness that you've decided to live with. The lies that work are the ones whose invalidation you've thought through. The ones that bite you are always the layer you forgot was even caching.

If you remember one thing: caching is layered, not singular. Each layer has different latency, capacity, and invalidation. The cheapest hit is the one that never leaves the CPU; the next is the one that never leaves the process; the next is the one that never leaves the host. Redis is somewhere on that list — not at the top of it.