The Incident: 8,000 Devices Go Dark
We were running a Node.js backend managing 12,000 MQTT-connected IoT devices — industrial sensors, environmental monitors, smart building controllers. Each device maintained a persistent WebSocket/MQTT connection to our broker.
At 11:04pm, a firmware OTA push to ~4,000 devices caused them to reboot simultaneously. They came back online within 60 seconds and tried to reconnect. What followed was a cascade that took down the entire fleet — not just the rebooted devices.
What Is a Socket and Why Does It Exhaust?
Every TCP connection — whether it's a WebSocket, MQTT session, or plain HTTP — requires an open file descriptor on the operating system. The OS has a hard limit on how many a single process can hold.
The Root Cause: Zombie Sockets
Our MQTT client pool was creating a new connection object for each device reconnect but not properly destroying the old one. The old TCP socket entered TIME_WAIT or CLOSE_WAIT state and held the file descriptor open — sometimes for minutes.
// THE PROBLEMATIC CODE
// When a device reconnected, we'd create a new MQTT client
// without explicitly ending the old one first
const clients = new Map(); // deviceId → MqttClient
function handleDeviceReconnect(deviceId) {
// BUG: old client is just overwritten, not destroyed
const client = mqtt.connect(BROKER_URL, {
clientId: deviceId,
clean: false, // persistent session
// Missing: keepAlive, reconnectPeriod config
});
clients.set(deviceId, client); // old client leaked!
client.on('connect', () => {
console.log(`${deviceId} connected`);
});
}
Diagnosing Socket Exhaustion in Production
# Check open file descriptors for a Node.js process
PID=$(pgrep -f "node server.js")
# Count open sockets
ls -la /proc/$PID/fd | wc -l
# See socket states
cat /proc/net/tcp | awk '{print $4}' | sort | uniq -c | sort -rn
# 01 = ESTABLISHED, 06 = TIME_WAIT, 08 = CLOSE_WAIT
# Check current ulimit for the process
cat /proc/$PID/limits | grep "open files"
# Real-time socket count (watch it climb)
watch -n 1 "ls /proc/$PID/fd | wc -l"
# Netstat breakdown by state
ss -s
# Output:
# Total: 18432
# TCP: 16408 (estab 8192, closed 4216, orphaned 0, timewait 4000)
The Fix: Proper Connection Lifecycle Management
// FIXED VERSION
const clients = new Map(); // deviceId → MqttClient
function handleDeviceReconnect(deviceId) {
// Step 1: Properly destroy the old connection first
const existing = clients.get(deviceId);
if (existing) {
existing.end(true); // force=true: destroy socket immediately
clients.delete(deviceId);
}
// Step 2: Create new connection with proper keepAlive config
const client = mqtt.connect(BROKER_URL, {
clientId: deviceId,
clean: false,
keepalive: 60, // send PING every 60s — broker detects stale
connectTimeout: 10000, // fail fast if broker unreachable
reconnectPeriod: 0, // disable auto-reconnect (we manage it)
// Add socket-level keepAlive for the TCP layer too
socketOptions: {
keepAlive: true,
initialDelay: 30000 // OS-level TCP keepalive after 30s idle
}
});
clients.set(deviceId, client);
client.on('connect', () => {
console.log(`${deviceId} connected`);
});
client.on('error', (err) => {
console.error(`${deviceId} error: ${err.message}`);
client.end(true);
clients.delete(deviceId);
});
// Critical: clean up on any close
client.on('close', () => {
if (clients.get(deviceId) === client) {
clients.delete(deviceId);
}
});
}
Fix 2: Raise the File Descriptor Limit
Even with proper cleanup, a 12,000-device system needs a higher FD limit as headroom. The OS default of 8,192 is too low.
# /etc/security/limits.conf — permanent change
# Add these lines:
* soft nofile 65536
* hard nofile 65536
root soft nofile 65536
root hard nofile 65536
# /etc/sysctl.conf — kernel-level limit
fs.file-max = 2097152
# Apply immediately (without reboot):
sysctl -p
# For systemd services (Node.js as a service):
# /etc/systemd/system/iot-backend.service
[Service]
LimitNOFILE=65536
Fix 3: Connection Pooling with Backpressure
Reconnect storms (like our OTA reboot) are dangerous without rate limiting. When 4,000 devices reconnect simultaneously, don't accept all of them at once.
// Reconnect queue with controlled concurrency
const Bottleneck = require('bottleneck');
const reconnectLimiter = new Bottleneck({
maxConcurrent: 100, // max 100 simultaneous reconnects
minTime: 10, // min 10ms between reconnects
reservoir: 200, // max 200 reconnects in burst
reservoirRefreshAmount: 200,
reservoirRefreshInterval: 10 * 1000, // refill every 10s
});
// Wrap the reconnect handler with the limiter
async function queueDeviceReconnect(deviceId) {
return reconnectLimiter.schedule(() =>
handleDeviceReconnect(deviceId)
);
}
// Now reconnect storms are absorbed gracefully
// 4,000 reconnects → spread over 40 seconds
// instead of hammering all at once
Fix 4: Monitoring — Catch It Before It Cascades
// Add to your Node.js server: export socket metrics
import { register, Gauge } from 'prom-client';
const socketGauge = new Gauge({
name: 'iot_open_sockets_total',
help: 'Number of open MQTT socket connections',
});
const zombieGauge = new Gauge({
name: 'iot_zombie_sockets_total',
help: 'Connections in Map but not CONNECTED state',
});
// Update every 30 seconds
setInterval(() => {
socketGauge.set(clients.size);
let zombies = 0;
for (const [, client] of clients) {
if (!client.connected) zombies++;
}
zombieGauge.set(zombies);
}, 30_000);
// Alert rule (Prometheus/Grafana):
// ALERT SocketExhaustionRisk
// IF iot_open_sockets_total > 50000
// OR iot_zombie_sockets_total > 500
Quick Reference: Socket Exhaustion Checklist
| Check | Command | Red flag |
|---|---|---|
| FD limit | ulimit -n | ≤ 8,192 with 1k+ connections |
| FD usage | ls /proc/$PID/fd | wc -l | > 80% of limit |
| CLOSE_WAIT count | ss -s | > 100 for CLOSE_WAIT |
| Zombie connections | App metrics | Connections in map but not CONNECTED |
| Reconnect storm | Connection rate logs | > 500 new connections in 5 seconds |
| Fix | Impact | When to apply |
|---|---|---|
Call socket.end(true) before replacing | Eliminates CLOSE_WAIT buildup | Always — this is the root fix |
Set keepalive on MQTT/WS clients | Broker detects stale connections | Any persistent connection |
Raise ulimit -n to 65536+ | Headroom for spikes | Any system with 1k+ connections |
| Rate-limit reconnects (Bottleneck, etc.) | Absorbs reconnect storms | When OTA/mass reboots are possible |
| Export socket metrics to Prometheus | Early warning before cascade | Production systems always |
The lesson
Socket exhaustion is silent until it isn't. EMFILE errors don't appear in business metrics — they appear in system logs that no one watches. The fix is three things working together: proper socket lifecycle management (always destroy before replace), a realistic FD limit (3× your peak), and metrics that show you the trend before the cascade. The five-line socket.end() fix we added took two minutes. The incident it prevented took four engineers four hours to recover from.