Engineering Blog · 23 articles

Real Problems. Real Engineering.

Deep dives into database internals, distributed systems, and production incidents. The tradeoffs nobody talks about.

Featured
System DesignArchitecture Decisions·20 min read

How Databases and Servers Shut Down Without Losing Your Data — Signal Handling, Drain, and the 30-Second Clock

Every process you run will be told to stop — the only question is whether it gets a request it can act on or a bullet it never sees coming. A first-principles walk through graceful shutdown via Unix signals: why SIGTERM is catchable and SIGKILL is not, the five-stage drain (catch → stop intake → drain → flush → exit 0) every well-behaved server runs, PostgreSQL's three shutdown modes mapped to three signals, Redis saving its RDB on SIGTERM (and losing everything on SIGKILL / the OOM killer), connection draining in web servers, how Kubernetes wires it together with terminationGracePeriodSeconds and a closing SIGKILL, the minimal correct handler in Go and Node, and the five pitfalls — PID 1 with no signal disposition, work inside the handler, unbounded drains, the endpoint-removal race, and a grace period shorter than your real drain.

SignalsSIGTERMGraceful ShutdownPostgreSQLRead →
01
Real Problems·WebSockets & Real-time

Slack Sends Messages Over HTTP, Not Its WebSocket — and the Durability Bug That Forced the Switch

When you press Enter in Slack, the message does not travel over the WebSocket your client is holding open — it goes out as a separate HTTPS POST to chat.postMessage. A first-principles walkthrough of why: what WebSocket.send() actually guarantees (almost nothing), the RFC 6455 gap where there is no application-layer ack, the four ways a WebSocket send dies silently, what HTTP gives you for free (status codes, idempotency keys, L7 load balancing, per-request timeouts), the RTM → Events API migration that made the same rule public for app developers, where Socket Mode fits in, and the “WebSocket as hint, REST as truth” pattern that every serious real-time product converges on.

WebSocketsReal-timeHTTP20 min·
02
Real Problems·Production Incidents

When AUTO_INCREMENT Stops Incrementing — The MySQL INT Overflow Anatomy

It's 3 AM. Inserts are failing with 'Duplicate entry 2147483647' — but you never inserted a duplicate. Welcome to one of MySQL's most quietly famous outages. A friendly walk through why a signed INT AUTO_INCREMENT dies at 2.1 billion, why the error is misleading, a two-minute laptop repro, the 70% alert that prevents the whole thing — and the two mitigations: ALTER to BIGINT for small tables, and the atomic table-swap senior operators run on billion-row tables in the middle of an outage.

MySQLDatabaseProduction Incidents18 min·
03
Real Problems·Production Incidents

“We Have Exactly-Once Delivery” — The Lie Your Architecture Is Telling You

Why “our queue guarantees exactly-once” is almost always a lie once your flow crosses a component boundary: Kafka EOS scope, ack-after-process, unique constraints, SQS FIFO, DB transactions with SMTP, retries without idempotency, Redis pub/sub, replays, gRPC — and what works: transactional outbox, idempotency done right, WebSocket patterns, saga steps, and observability metrics that expose the truth.

Distributed SystemsReal ProblemsMessaging & Queues25 min·
04
System Design·Tradeoffs in Production

TCP Keep-Alive vs Application Heartbeat — Three Different Things Called “Keep-Alive”

Kernel TCP keep-alive probes the network path; app heartbeats prove the peer process is serving; HTTP Connection: keep-alive only reuses sockets. Why NAT, load balancers, and firewalls still drop “idle” long-lived connections, how defaults like Linux’s multi-hour timers compare to prod WebSocket patterns, and when you really need both mechanisms.

TCPNetworkingWebSockets16 min·
05
System Design·Architecture Decisions

Caching Is Not 'Add Redis' — A Layer-by-Layer Tour from CPU to CDN

A senior-engineer walkthrough of the whole cache stack: CPU L1/L2/L3, TLB, OS page cache, DB buffer pool, in-process LRU (Caffeine), distributed cache (Redis/Memcached), HTTP cache + ETags, CDN edge, browser, DNS, reverse proxies, query plan cache. Latency budgets at every layer, when each layer is the right answer, the patterns (cache-aside, read-through, write-through, write-behind, stale-while-revalidate, single-flight), and the failure modes (thundering herd, hot keys, negative-cache stampedes) — with a decision framework for which layer to actually cache at.

CachingSystem DesignPerformance20 min·
06
Database·Picking the Right DB

How Redis Actually Deletes Expired Keys — The Lazy + Active Hybrid You Didn't Know About

A senior-engineer walkthrough of how Redis really handles TTLs. Why TTL=0 doesn't free memory. The two strategies — lazy (on-access) and active (the random-sample-and-25%-threshold sampler). The hz parameter, active-expire-effort, lazyfree-lazy-expire. DEL vs UNLINK and why big-key expiry used to pause the event loop. How replicas handle expiration via the replication stream. maxmemory + eviction policies as the actual safety net. The metrics from INFO that tell you when the sampler is falling behind.

DatabaseRedisCaching15 min·
07
Database·Picking the Right DB

LSM Trees, the Deep Cut — MemTable, WAL, SSTables, Compaction and Why Your Writes Are So Fast

A senior-engineer walkthrough of Log-Structured Merge Trees: MemTable + WAL durability, SSTable layout with sparse index and bloom filters, the L0→Ln level hierarchy, leveled vs tiered vs FIFO compaction, the three amplifications (write/read/space), bloom filter math, and the RocksDB and Cassandra tuning knobs that actually matter. Why RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, ClickHouse and TiKV all picked it — and when you should not.

DatabaseStorage EngineLSM Tree22 min·
08
Database·Query Optimization

"Push It All on the Database" — The Lie That Works Until It Doesn't

Every team ships with the same lie: throw everything on the database and it'll handle correctness. Works at 50 writes/sec. At 5,000 writes/sec, every FK is a lock queue, every cascade synchronously deletes thousands, fsync becomes the clock. Why staging never catches it, ten places where the bill shows up, the cost curve of a hot FK, why cascades are dangerous, how to diagnose your bottleneck, and the enforcer+checker pattern senior teams use to move constraints off the database without losing correctness.

DatabasePerformanceMySQL18 min·