— Writing —

All writing

All topics · 23 articles
System DesignArchitecture Decisions·20 min read

How Databases and Servers Shut Down Without Losing Your Data — Signal Handling, Drain, and the 30-Second Clock

Every process you run will be told to stop — the only question is whether it gets a request it can act on or a bullet it never sees coming. A first-principles walk through graceful shutdown via Unix signals: why SIGTERM is catchable and SIGKILL is not, the five-stage drain (catch → stop intake → drain → flush → exit 0) every well-behaved server runs, PostgreSQL's three shutdown modes mapped to three signals, Redis saving its RDB on SIGTERM (and losing everything on SIGKILL / the OOM killer), connection draining in web servers, how Kubernetes wires it together with terminationGracePeriodSeconds and a closing SIGKILL, the minimal correct handler in Go and Node, and the five pitfalls — PID 1 with no signal disposition, work inside the handler, unbounded drains, the endpoint-removal race, and a grace period shorter than your real drain.

SignalsSIGTERMGraceful ShutdownPostgreSQLRead →
Real Problems·WebSockets & Real-time

Slack Sends Messages Over HTTP, Not Its WebSocket — and the Durability Bug That Forced the Switch

When you press Enter in Slack, the message does not travel over the WebSocket your client is holding open — it goes out as a separate HTTPS POST to chat.postMessage. A first-principles walkthrough of why: what WebSocket.send() actually guarantees (almost nothing), the RFC 6455 gap where there is no application-layer ack, the four ways a WebSocket send dies silently, what HTTP gives you for free (status codes, idempotency keys, L7 load balancing, per-request timeouts), the RTM → Events API migration that made the same rule public for app developers, where Socket Mode fits in, and the “WebSocket as hint, REST as truth” pattern that every serious real-time product converges on.

WebSocketsReal-timeHTTP20 min read·
Real Problems·Production Incidents

When AUTO_INCREMENT Stops Incrementing — The MySQL INT Overflow Anatomy

It's 3 AM. Inserts are failing with 'Duplicate entry 2147483647' — but you never inserted a duplicate. Welcome to one of MySQL's most quietly famous outages. A friendly walk through why a signed INT AUTO_INCREMENT dies at 2.1 billion, why the error is misleading, a two-minute laptop repro, the 70% alert that prevents the whole thing — and the two mitigations: ALTER to BIGINT for small tables, and the atomic table-swap senior operators run on billion-row tables in the middle of an outage.

MySQLDatabaseProduction Incidents18 min read·
Real Problems·Production Incidents

“We Have Exactly-Once Delivery” — The Lie Your Architecture Is Telling You

Why “our queue guarantees exactly-once” is almost always a lie once your flow crosses a component boundary: Kafka EOS scope, ack-after-process, unique constraints, SQS FIFO, DB transactions with SMTP, retries without idempotency, Redis pub/sub, replays, gRPC — and what works: transactional outbox, idempotency done right, WebSocket patterns, saga steps, and observability metrics that expose the truth.

Distributed SystemsReal ProblemsMessaging & Queues25 min read·
System Design·Tradeoffs in Production

TCP Keep-Alive vs Application Heartbeat — Three Different Things Called “Keep-Alive”

Kernel TCP keep-alive probes the network path; app heartbeats prove the peer process is serving; HTTP Connection: keep-alive only reuses sockets. Why NAT, load balancers, and firewalls still drop “idle” long-lived connections, how defaults like Linux’s multi-hour timers compare to prod WebSocket patterns, and when you really need both mechanisms.

TCPNetworkingWebSockets16 min read·
System Design·Architecture Decisions

Caching Is Not 'Add Redis' — A Layer-by-Layer Tour from CPU to CDN

A senior-engineer walkthrough of the whole cache stack: CPU L1/L2/L3, TLB, OS page cache, DB buffer pool, in-process LRU (Caffeine), distributed cache (Redis/Memcached), HTTP cache + ETags, CDN edge, browser, DNS, reverse proxies, query plan cache. Latency budgets at every layer, when each layer is the right answer, the patterns (cache-aside, read-through, write-through, write-behind, stale-while-revalidate, single-flight), and the failure modes (thundering herd, hot keys, negative-cache stampedes) — with a decision framework for which layer to actually cache at.

CachingSystem DesignPerformance20 min read·
Database·Picking the Right DB

How Redis Actually Deletes Expired Keys — The Lazy + Active Hybrid You Didn't Know About

A senior-engineer walkthrough of how Redis really handles TTLs. Why TTL=0 doesn't free memory. The two strategies — lazy (on-access) and active (the random-sample-and-25%-threshold sampler). The hz parameter, active-expire-effort, lazyfree-lazy-expire. DEL vs UNLINK and why big-key expiry used to pause the event loop. How replicas handle expiration via the replication stream. maxmemory + eviction policies as the actual safety net. The metrics from INFO that tell you when the sampler is falling behind.

DatabaseRedisCaching15 min read·
Database·Picking the Right DB

LSM Trees, the Deep Cut — MemTable, WAL, SSTables, Compaction and Why Your Writes Are So Fast

A senior-engineer walkthrough of Log-Structured Merge Trees: MemTable + WAL durability, SSTable layout with sparse index and bloom filters, the L0→Ln level hierarchy, leveled vs tiered vs FIFO compaction, the three amplifications (write/read/space), bloom filter math, and the RocksDB and Cassandra tuning knobs that actually matter. Why RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, ClickHouse and TiKV all picked it — and when you should not.

DatabaseStorage EngineLSM Tree22 min read·
Database·Query Optimization

"Push It All on the Database" — The Lie That Works Until It Doesn't

Every team ships with the same lie: throw everything on the database and it'll handle correctness. Works at 50 writes/sec. At 5,000 writes/sec, every FK is a lock queue, every cascade synchronously deletes thousands, fsync becomes the clock. Why staging never catches it, ten places where the bill shows up, the cost curve of a hot FK, why cascades are dangerous, how to diagnose your bottleneck, and the enforcer+checker pattern senior teams use to move constraints off the database without losing correctness.

DatabasePerformanceMySQL18 min read·
System Design·Architecture Decisions

Why Your Consistent Hashing Is Still Failing in Prod — and When Range Partitions Are the Right Tool

SVG-backed walkthrough: ring hotspots, successor overload when a node dies, and two ring generations during deploy. Virtual nodes, Redis-style key skew, range split for a fat month, and when range partitions beat a hash ring. Case studies (composite, production-shaped), a deep look at range sharding, and a decision table — with metrics and what actually worked.

Consistent HashingRange PartitioningDistributed Systems20 min read·
Database·Schema Design

UUID v4 vs UUID v7 vs Snowflake — Pick the Right Primary Key

A plain-English walkthrough: how B-tree inserts actually work, what a page split costs, why random UUID v4 keys bloat the index by 2× and blow the buffer pool. For each ID — UUID v4, Snowflake, UUID v7, AUTO_INCREMENT — what it is, why people pick it, what breaks. Real MySQL benchmark at 10M rows, Postgres differences, and a dual-write migration that works.

DatabaseMySQLPostgreSQL14 min read·
AWS·Load Balancing

Load Balancer Explained: L4 vs L7 — What the Door-Person Actually Sees

A simple, ground-up guide: L4 forwards raw TCP/UDP by IP and port, L7 opens the HTTP request and routes by URL, host, headers or cookies. Clear when-to-use-what for both basic and advanced scenarios (NLB+ALB stack, PrivateLink, gRPC, service mesh) plus a full AWS map — ALB, NLB, GWLB, CloudFront, API Gateway, Route 53, Global Accelerator — with gotchas around sticky sessions, client IP preservation, and WebSockets.

Load BalancingAWSNLB15 min read·
System Design·Architecture Decisions

C10K vs C10M: Two Scaling Problems People Confuse for One

A beginner-friendly walkthrough: what “concurrent” means, what the C10K and C10M problems actually are, how the industry usually fixes them (event loops vs whole-system design), and what still goes wrong in production — blocking I/O, fd limits, reconnect storms, databases, and tail latency.

ConcurrencyNetworkingepoll20 min read·
AWS·Scaling & Infra

Why Your Autoscaling Isn't Working: The Truth About Spiky Traffic

Every type of scaling explained simply — horizontal, vertical, reactive, predictive, scheduled. Why HPA and ASG break on Dream11 during an IPL final, flash sales, and push notifications. Four real case studies (Dream11 IPL, OTT launch, ticketing fan-out, ETL sawtooth) with timelines, numbers, and the exact architectures that survived. If you've ever watched your p99 explode while HPA takes its sweet time — this is for you.

AutoscalingKubernetesHPA20 min read·
Real Problems·Production Incidents

Stop-the-World: When GC Runs, Everything Freezes

When the Garbage Collector pauses, every thread freezes — zero requests, zero responses. This post explains how GC actually works: the heap, mark-and-sweep, minor vs major collections, Stop-the-World pauses, and what triggers them.

Garbage CollectionPerformanceMemory Management16 min read·
Database·Performance

Connection Pool Sizing: The Math Behind Why Your App Is Slow

Why your connection pool config is probably wrong — and how to fix it with Little's Law, Kingman's Formula, and the process-to-core ratio. Includes a real-world banking system that had 640 connections but only needed 5.

DatabasePerformanceConnection Pooling14 min read·
Database·Schema Design

Stop Using DELETE in Production: Hard Delete vs Soft Delete in High-Concurrency Systems

Why DELETE is the most expensive DML operation in a relational database — and why a simple UPDATE SET deleted_at = NOW() is 4-8x faster under concurrent load. With InnoDB internals, B-tree diagrams, and benchmark numbers.

MySQLPostgreSQLPerformance12 min read·
Database·Query Optimization

Why Your Indexes Aren't Working: Non-SARGable Queries Explained

A deep-dive into non-SARGable SQL with B-tree diagrams, real EXPLAIN output, and the exact rewrites — told through a gaming dashboard that went from 4,200ms to 2ms without touching the schema.

MySQLPostgreSQLNode.js18 min read·
Real Problems·Production Incidents

The Brutal Truth About Multi-Tenant IoT Data Filtering: Every Obvious Solution Failed

A user had access to 1 lakh devices across MongoDB, ClickHouse, and Elasticsearch. We tried seven approaches. All broke. The full journey — dead ends, edge cases, and what production systems actually do.

System DesignIoTDistributed Systems15 min read·
Database·Query Optimization

Vector Search Explained: From KNN to HNSW — A Complete Guide

How similarity search works under the hood — and why naive brute-force collapses at scale.

SearchML13 min read·
Database·Query Optimization

MySQL Query Optimization: Real Lessons from a 36GB Index Disaster

What happens when your index outgrows your data — and how we fixed it without downtime in production.

MySQLProduction5 min read·
System Design·Architecture Decisions

How Twitter Generates Your Home Feed in 5 Seconds

Fan-out on write vs fan-out on read — and why Twitter uses a hybrid at 500M users. A beginner-friendly breakdown.

System Design8 min read·
Real Problems·WebSockets & Real-time

Socket Exhaustion in High-Concurrency IoT Systems

A production case study — how a small config issue brought down an entire IoT device fleet and what fixed it.

IoTNode.js4 min read·
Writing — realproblem.me