— Writing —

All writing

System Design · 6 articles

TopicsAll Database Real Problems AWS System Design

System DesignAll in section Architecture Decisions Tradeoffs in Production Observability

System DesignArchitecture Decisions·20 min read

How Databases and Servers Shut Down Without Losing Your Data — Signal Handling, Drain, and the 30-Second Clock

Every process you run will be told to stop — the only question is whether it gets a request it can act on or a bullet it never sees coming. A first-principles walk through graceful shutdown via Unix signals: why SIGTERM is catchable and SIGKILL is not, the five-stage drain (catch → stop intake → drain → flush → exit 0) every well-behaved server runs, PostgreSQL's three shutdown modes mapped to three signals, Redis saving its RDB on SIGTERM (and losing everything on SIGKILL / the OOM killer), connection draining in web servers, how Kubernetes wires it together with terminationGracePeriodSeconds and a closing SIGKILL, the minimal correct handler in Go and Node, and the five pitfalls — PID 1 with no signal disposition, work inside the handler, unbounded drains, the endpoint-removal race, and a grace period shorter than your real drain.

SignalsSIGTERMGraceful ShutdownPostgreSQLJun 6, 2026Read →

System Design·Tradeoffs in Production

TCP Keep-Alive vs Application Heartbeat — Three Different Things Called “Keep-Alive”

Kernel TCP keep-alive probes the network path; app heartbeats prove the peer process is serving; HTTP Connection: keep-alive only reuses sockets. Why NAT, load balancers, and firewalls still drop “idle” long-lived connections, how defaults like Linux’s multi-hour timers compare to prod WebSocket patterns, and when you really need both mechanisms.

TCPNetworkingWebSockets16 min read·May 9

System Design·Architecture Decisions

Caching Is Not 'Add Redis' — A Layer-by-Layer Tour from CPU to CDN

A senior-engineer walkthrough of the whole cache stack: CPU L1/L2/L3, TLB, OS page cache, DB buffer pool, in-process LRU (Caffeine), distributed cache (Redis/Memcached), HTTP cache + ETags, CDN edge, browser, DNS, reverse proxies, query plan cache. Latency budgets at every layer, when each layer is the right answer, the patterns (cache-aside, read-through, write-through, write-behind, stale-while-revalidate, single-flight), and the failure modes (thundering herd, hot keys, negative-cache stampedes) — with a decision framework for which layer to actually cache at.

CachingSystem DesignPerformance20 min read·May 2

System Design·Architecture Decisions

Why Your Consistent Hashing Is Still Failing in Prod — and When Range Partitions Are the Right Tool

SVG-backed walkthrough: ring hotspots, successor overload when a node dies, and two ring generations during deploy. Virtual nodes, Redis-style key skew, range split for a fat month, and when range partitions beat a hash ring. Case studies (composite, production-shaped), a deep look at range sharding, and a decision table — with metrics and what actually worked.

Consistent HashingRange PartitioningDistributed Systems20 min read·Apr 27

System Design·Architecture Decisions

C10K vs C10M: Two Scaling Problems People Confuse for One

A beginner-friendly walkthrough: what “concurrent” means, what the C10K and C10M problems actually are, how the industry usually fixes them (event loops vs whole-system design), and what still goes wrong in production — blocking I/O, fd limits, reconnect storms, databases, and tail latency.

ConcurrencyNetworkingepoll20 min read·Apr 19

System Design·Architecture Decisions

How Twitter Generates Your Home Feed in 5 Seconds

Fan-out on write vs fan-out on read — and why Twitter uses a hybrid at 500M users. A beginner-friendly breakdown.

System Design8 min read·Sep 12

Latest article

How Databases and Servers Shut Down Without Losing Your Data — Signal Handling, Drain, and the 30-Second Clock

More articles

TCP Keep-Alive vs Application Heartbeat — Three Different Things Called “Keep-Alive”

Caching Is Not 'Add Redis' — A Layer-by-Layer Tour from CPU to CDN

Why Your Consistent Hashing Is Still Failing in Prod — and When Range Partitions Are the Right Tool

C10K vs C10M: Two Scaling Problems People Confuse for One

How Twitter Generates Your Home Feed in 5 Seconds