— Writing —
Every process you run will be told to stop — the only question is whether it gets a request it can act on or a bullet it never sees coming. A first-principles walk through graceful shutdown via Unix signals: why SIGTERM is catchable and SIGKILL is not, the five-stage drain (catch → stop intake → drain → flush → exit 0) every well-behaved server runs, PostgreSQL's three shutdown modes mapped to three signals, Redis saving its RDB on SIGTERM (and losing everything on SIGKILL / the OOM killer), connection draining in web servers, how Kubernetes wires it together with terminationGracePeriodSeconds and a closing SIGKILL, the minimal correct handler in Go and Node, and the five pitfalls — PID 1 with no signal disposition, work inside the handler, unbounded drains, the endpoint-removal race, and a grace period shorter than your real drain.
When you press Enter in Slack, the message does not travel over the WebSocket your client is holding open — it goes out as a separate HTTPS POST to chat.postMessage. A first-principles walkthrough of why: what WebSocket.send() actually guarantees (almost nothing), the RFC 6455 gap where there is no application-layer ack, the four ways a WebSocket send dies silently, what HTTP gives you for free (status codes, idempotency keys, L7 load balancing, per-request timeouts), the RTM → Events API migration that made the same rule public for app developers, where Socket Mode fits in, and the “WebSocket as hint, REST as truth” pattern that every serious real-time product converges on.
It's 3 AM. Inserts are failing with 'Duplicate entry 2147483647' — but you never inserted a duplicate. Welcome to one of MySQL's most quietly famous outages. A friendly walk through why a signed INT AUTO_INCREMENT dies at 2.1 billion, why the error is misleading, a two-minute laptop repro, the 70% alert that prevents the whole thing — and the two mitigations: ALTER to BIGINT for small tables, and the atomic table-swap senior operators run on billion-row tables in the middle of an outage.
Why “our queue guarantees exactly-once” is almost always a lie once your flow crosses a component boundary: Kafka EOS scope, ack-after-process, unique constraints, SQS FIFO, DB transactions with SMTP, retries without idempotency, Redis pub/sub, replays, gRPC — and what works: transactional outbox, idempotency done right, WebSocket patterns, saga steps, and observability metrics that expose the truth.
Kernel TCP keep-alive probes the network path; app heartbeats prove the peer process is serving; HTTP Connection: keep-alive only reuses sockets. Why NAT, load balancers, and firewalls still drop “idle” long-lived connections, how defaults like Linux’s multi-hour timers compare to prod WebSocket patterns, and when you really need both mechanisms.
A senior-engineer walkthrough of the whole cache stack: CPU L1/L2/L3, TLB, OS page cache, DB buffer pool, in-process LRU (Caffeine), distributed cache (Redis/Memcached), HTTP cache + ETags, CDN edge, browser, DNS, reverse proxies, query plan cache. Latency budgets at every layer, when each layer is the right answer, the patterns (cache-aside, read-through, write-through, write-behind, stale-while-revalidate, single-flight), and the failure modes (thundering herd, hot keys, negative-cache stampedes) — with a decision framework for which layer to actually cache at.
A senior-engineer walkthrough of how Redis really handles TTLs. Why TTL=0 doesn't free memory. The two strategies — lazy (on-access) and active (the random-sample-and-25%-threshold sampler). The hz parameter, active-expire-effort, lazyfree-lazy-expire. DEL vs UNLINK and why big-key expiry used to pause the event loop. How replicas handle expiration via the replication stream. maxmemory + eviction policies as the actual safety net. The metrics from INFO that tell you when the sampler is falling behind.
A senior-engineer walkthrough of Log-Structured Merge Trees: MemTable + WAL durability, SSTable layout with sparse index and bloom filters, the L0→Ln level hierarchy, leveled vs tiered vs FIFO compaction, the three amplifications (write/read/space), bloom filter math, and the RocksDB and Cassandra tuning knobs that actually matter. Why RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, ClickHouse and TiKV all picked it — and when you should not.
Every team ships with the same lie: throw everything on the database and it'll handle correctness. Works at 50 writes/sec. At 5,000 writes/sec, every FK is a lock queue, every cascade synchronously deletes thousands, fsync becomes the clock. Why staging never catches it, ten places where the bill shows up, the cost curve of a hot FK, why cascades are dangerous, how to diagnose your bottleneck, and the enforcer+checker pattern senior teams use to move constraints off the database without losing correctness.
SVG-backed walkthrough: ring hotspots, successor overload when a node dies, and two ring generations during deploy. Virtual nodes, Redis-style key skew, range split for a fat month, and when range partitions beat a hash ring. Case studies (composite, production-shaped), a deep look at range sharding, and a decision table — with metrics and what actually worked.
A plain-English walkthrough: how B-tree inserts actually work, what a page split costs, why random UUID v4 keys bloat the index by 2× and blow the buffer pool. For each ID — UUID v4, Snowflake, UUID v7, AUTO_INCREMENT — what it is, why people pick it, what breaks. Real MySQL benchmark at 10M rows, Postgres differences, and a dual-write migration that works.
A simple, ground-up guide: L4 forwards raw TCP/UDP by IP and port, L7 opens the HTTP request and routes by URL, host, headers or cookies. Clear when-to-use-what for both basic and advanced scenarios (NLB+ALB stack, PrivateLink, gRPC, service mesh) plus a full AWS map — ALB, NLB, GWLB, CloudFront, API Gateway, Route 53, Global Accelerator — with gotchas around sticky sessions, client IP preservation, and WebSockets.
A beginner-friendly walkthrough: what “concurrent” means, what the C10K and C10M problems actually are, how the industry usually fixes them (event loops vs whole-system design), and what still goes wrong in production — blocking I/O, fd limits, reconnect storms, databases, and tail latency.
Every type of scaling explained simply — horizontal, vertical, reactive, predictive, scheduled. Why HPA and ASG break on Dream11 during an IPL final, flash sales, and push notifications. Four real case studies (Dream11 IPL, OTT launch, ticketing fan-out, ETL sawtooth) with timelines, numbers, and the exact architectures that survived. If you've ever watched your p99 explode while HPA takes its sweet time — this is for you.
When the Garbage Collector pauses, every thread freezes — zero requests, zero responses. This post explains how GC actually works: the heap, mark-and-sweep, minor vs major collections, Stop-the-World pauses, and what triggers them.
Why your connection pool config is probably wrong — and how to fix it with Little's Law, Kingman's Formula, and the process-to-core ratio. Includes a real-world banking system that had 640 connections but only needed 5.
Why DELETE is the most expensive DML operation in a relational database — and why a simple UPDATE SET deleted_at = NOW() is 4-8x faster under concurrent load. With InnoDB internals, B-tree diagrams, and benchmark numbers.
A deep-dive into non-SARGable SQL with B-tree diagrams, real EXPLAIN output, and the exact rewrites — told through a gaming dashboard that went from 4,200ms to 2ms without touching the schema.
A user had access to 1 lakh devices across MongoDB, ClickHouse, and Elasticsearch. We tried seven approaches. All broke. The full journey — dead ends, edge cases, and what production systems actually do.
How similarity search works under the hood — and why naive brute-force collapses at scale.
What happens when your index outgrows your data — and how we fixed it without downtime in production.
Fan-out on write vs fan-out on read — and why Twitter uses a hybrid at 500M users. A beginner-friendly breakdown.
A production case study — how a small config issue brought down an entire IoT device fleet and what fixed it.