— Writing —

All writing

Real Problems · 6 articles
Real ProblemsWebSockets & Real-time·20 min read

Slack Sends Messages Over HTTP, Not Its WebSocket — and the Durability Bug That Forced the Switch

When you press Enter in Slack, the message does not travel over the WebSocket your client is holding open — it goes out as a separate HTTPS POST to chat.postMessage. A first-principles walkthrough of why: what WebSocket.send() actually guarantees (almost nothing), the RFC 6455 gap where there is no application-layer ack, the four ways a WebSocket send dies silently, what HTTP gives you for free (status codes, idempotency keys, L7 load balancing, per-request timeouts), the RTM → Events API migration that made the same rule public for app developers, where Socket Mode fits in, and the “WebSocket as hint, REST as truth” pattern that every serious real-time product converges on.

WebSocketsReal-timeHTTPSlackRead →
Real Problems·Production Incidents

When AUTO_INCREMENT Stops Incrementing — The MySQL INT Overflow Anatomy

It's 3 AM. Inserts are failing with 'Duplicate entry 2147483647' — but you never inserted a duplicate. Welcome to one of MySQL's most quietly famous outages. A friendly walk through why a signed INT AUTO_INCREMENT dies at 2.1 billion, why the error is misleading, a two-minute laptop repro, the 70% alert that prevents the whole thing — and the two mitigations: ALTER to BIGINT for small tables, and the atomic table-swap senior operators run on billion-row tables in the middle of an outage.

MySQLDatabaseProduction Incidents18 min read·
Real Problems·Production Incidents

“We Have Exactly-Once Delivery” — The Lie Your Architecture Is Telling You

Why “our queue guarantees exactly-once” is almost always a lie once your flow crosses a component boundary: Kafka EOS scope, ack-after-process, unique constraints, SQS FIFO, DB transactions with SMTP, retries without idempotency, Redis pub/sub, replays, gRPC — and what works: transactional outbox, idempotency done right, WebSocket patterns, saga steps, and observability metrics that expose the truth.

Distributed SystemsReal ProblemsMessaging & Queues25 min read·
Real Problems·Production Incidents

Stop-the-World: When GC Runs, Everything Freezes

When the Garbage Collector pauses, every thread freezes — zero requests, zero responses. This post explains how GC actually works: the heap, mark-and-sweep, minor vs major collections, Stop-the-World pauses, and what triggers them.

Garbage CollectionPerformanceMemory Management16 min read·
Real Problems·Production Incidents

The Brutal Truth About Multi-Tenant IoT Data Filtering: Every Obvious Solution Failed

A user had access to 1 lakh devices across MongoDB, ClickHouse, and Elasticsearch. We tried seven approaches. All broke. The full journey — dead ends, edge cases, and what production systems actually do.

System DesignIoTDistributed Systems15 min read·
Real Problems·WebSockets & Real-time

Socket Exhaustion in High-Concurrency IoT Systems

A production case study — how a small config issue brought down an entire IoT device fleet and what fixed it.

IoTNode.js4 min read·
Writing — realproblem.me