Socket Exhaustion How 6,000 Zombie Sockets Killed 8,000 Devices Node.js Server — File Descriptors 6,000 zombie sockets 8192 LIMIT 0 8192/8192 EMFILE error! EMFILE error! EMFILE error! Zombie sockets (never closed) New reconnects (OTA reboot) 12,000 devices 8,192 fd limit 90 seconds to cascade

The Incident: 8,000 Devices Go Dark

We were running a Node.js backend managing 12,000 MQTT-connected IoT devices — industrial sensors, environmental monitors, smart building controllers. Each device maintained a persistent WebSocket/MQTT connection to our broker.

At 11:04pm, a firmware OTA push to ~4,000 devices caused them to reboot simultaneously. They came back online within 60 seconds and tried to reconnect. What followed was a cascade that took down the entire fleet — not just the rebooted devices.

The Timeline 11:04pmOTA push triggers reboot on 4,000 devices 11:05pmDevices come back online, start reconnecting 11:05:308,192 file descriptors open -- hit ulimit 11:05:45EMFILE: too many open files 11:06pmMQTT broker stops accepting connections 11:06:30Existing devices timeout, reconnect cascade begins 11:07pm8,000+ devices offline. Cascade complete. Root cause: ~6,000 zombie sockets from previous connections never properly closed. 4,000 new reconnects pushed over ulimit.

What Is a Socket and Why Does It Exhaust?

Every TCP connection — whether it's a WebSocket, MQTT session, or plain HTTP — requires an open file descriptor on the operating system. The OS has a hard limit on how many a single process can hold.

File Descriptors: The OS Limit On Linux, each process has two limits: Soft limit (ulimit -n): 1,024 or 8,192 Hard limit: 65,536 or higher Hit soft limit = EMFILE error. No new sockets can be created. What consumes file descriptors: Item FDs consumed TCP socket (open)1 TCP socket (zombie)1 -- still counts! File handle1 stdin/stdout/stderr3 (per process) Node.js internals~20 12,000 device sockets + zombies + internals = easily hits 8,192 limit

The Root Cause: Zombie Sockets

Our MQTT client pool was creating a new connection object for each device reconnect but not properly destroying the old one. The old TCP socket entered TIME_WAIT or CLOSE_WAIT state and held the file descriptor open — sometimes for minutes.

// THE PROBLEMATIC CODE
// When a device reconnected, we'd create a new MQTT client
// without explicitly ending the old one first

const clients = new Map(); // deviceId → MqttClient

function handleDeviceReconnect(deviceId) {
  // BUG: old client is just overwritten, not destroyed
  const client = mqtt.connect(BROKER_URL, {
    clientId: deviceId,
    clean: false,  // persistent session
    // Missing: keepAlive, reconnectPeriod config
  });

  clients.set(deviceId, client); // old client leaked!

  client.on('connect', () => {
    console.log(`${deviceId} connected`);
  });
}
The Socket Leak: What Actually Happened Before OTA: 12,000 healthy connections Device A -> Socket A (ESTABLISHED) Device B -> Socket B (ESTABLISHED) ...12,000 sockets... OTA reboot: devices disconnect Socket A -> CLOSE_WAIT (waiting for app to close) Socket B -> CLOSE_WAIT ...4,000 in CLOSE_WAIT... Devices reconnect: NEW sockets created Socket A (old) -> CLOSE_WAIT FD still held! Socket A' (new) -> ESTABLISHED Socket B (old) -> CLOSE_WAIT FD still held! Socket B' (new) -> ESTABLISHED ...doubled sockets... Result: 8,000 (4K old + 4K new) + 8,000 existing = 16,000 FDs needed Limit: 8,192. EMFILE cascade begins.

Diagnosing Socket Exhaustion in Production

# Check open file descriptors for a Node.js process
PID=$(pgrep -f "node server.js")

# Count open sockets
ls -la /proc/$PID/fd | wc -l

# See socket states
cat /proc/net/tcp | awk '{print $4}' | sort | uniq -c | sort -rn
# 01 = ESTABLISHED, 06 = TIME_WAIT, 08 = CLOSE_WAIT

# Check current ulimit for the process
cat /proc/$PID/limits | grep "open files"

# Real-time socket count (watch it climb)
watch -n 1 "ls /proc/$PID/fd | wc -l"

# Netstat breakdown by state
ss -s
# Output:
# Total: 18432
# TCP:   16408 (estab 8192, closed 4216, orphaned 0, timewait 4000)
What to Look For Healthy System ESTABLISHED: ~12,000 TIME_WAIT: < 500 CLOSE_WAIT: < 50 Active connections, normal close Sick System (socket leak) ESTABLISHED: ~8,000 TIME_WAIT: ~2,000 CLOSE_WAIT: ~6,000 RED FLAG App didn't call close() CLOSE_WAIT = remote side sent FIN, but your app hasn't called socket.close() yet. These pile up when connections aren't properly cleaned up.

The Fix: Proper Connection Lifecycle Management

// FIXED VERSION
const clients = new Map(); // deviceId → MqttClient

function handleDeviceReconnect(deviceId) {
  // Step 1: Properly destroy the old connection first
  const existing = clients.get(deviceId);
  if (existing) {
    existing.end(true); // force=true: destroy socket immediately
    clients.delete(deviceId);
  }

  // Step 2: Create new connection with proper keepAlive config
  const client = mqtt.connect(BROKER_URL, {
    clientId: deviceId,
    clean: false,
    keepalive: 60,          // send PING every 60s — broker detects stale
    connectTimeout: 10000,  // fail fast if broker unreachable
    reconnectPeriod: 0,     // disable auto-reconnect (we manage it)
    // Add socket-level keepAlive for the TCP layer too
    socketOptions: {
      keepAlive: true,
      initialDelay: 30000   // OS-level TCP keepalive after 30s idle
    }
  });

  clients.set(deviceId, client);

  client.on('connect', () => {
    console.log(`${deviceId} connected`);
  });

  client.on('error', (err) => {
    console.error(`${deviceId} error: ${err.message}`);
    client.end(true);
    clients.delete(deviceId);
  });

  // Critical: clean up on any close
  client.on('close', () => {
    if (clients.get(deviceId) === client) {
      clients.delete(deviceId);
    }
  });
}

Fix 2: Raise the File Descriptor Limit

Even with proper cleanup, a 12,000-device system needs a higher FD limit as headroom. The OS default of 8,192 is too low.

# /etc/security/limits.conf — permanent change
# Add these lines:
*    soft    nofile    65536
*    hard    nofile    65536
root soft    nofile    65536
root hard    nofile    65536

# /etc/sysctl.conf — kernel-level limit
fs.file-max = 2097152

# Apply immediately (without reboot):
sysctl -p

# For systemd services (Node.js as a service):
# /etc/systemd/system/iot-backend.service
[Service]
LimitNOFILE=65536
FD Limit Sizing Guide Required FDs = (active connections x 1.5) + 500 12,000 devices 12,000 x 1.5 = 18,000 18,000 + 500 = 18,500 min Set ulimit: 65,536 100,000 devices 100,000 x 1.5 = 150,000 150,000 + 500 = 150,500 min Set ulimit: 262,144 Rule of thumb: Always 3x your steady-state connection count.

Fix 3: Connection Pooling with Backpressure

Reconnect storms (like our OTA reboot) are dangerous without rate limiting. When 4,000 devices reconnect simultaneously, don't accept all of them at once.

// Reconnect queue with controlled concurrency
const Bottleneck = require('bottleneck');

const reconnectLimiter = new Bottleneck({
  maxConcurrent: 100,     // max 100 simultaneous reconnects
  minTime: 10,            // min 10ms between reconnects
  reservoir: 200,         // max 200 reconnects in burst
  reservoirRefreshAmount: 200,
  reservoirRefreshInterval: 10 * 1000, // refill every 10s
});

// Wrap the reconnect handler with the limiter
async function queueDeviceReconnect(deviceId) {
  return reconnectLimiter.schedule(() =>
    handleDeviceReconnect(deviceId)
  );
}

// Now reconnect storms are absorbed gracefully
// 4,000 reconnects → spread over 40 seconds
// instead of hammering all at once

Fix 4: Monitoring — Catch It Before It Cascades

// Add to your Node.js server: export socket metrics
import { register, Gauge } from 'prom-client';

const socketGauge = new Gauge({
  name: 'iot_open_sockets_total',
  help: 'Number of open MQTT socket connections',
});

const zombieGauge = new Gauge({
  name: 'iot_zombie_sockets_total',
  help: 'Connections in Map but not CONNECTED state',
});

// Update every 30 seconds
setInterval(() => {
  socketGauge.set(clients.size);

  let zombies = 0;
  for (const [, client] of clients) {
    if (!client.connected) zombies++;
  }
  zombieGauge.set(zombies);
}, 30_000);

// Alert rule (Prometheus/Grafana):
// ALERT SocketExhaustionRisk
// IF iot_open_sockets_total > 50000
// OR iot_zombie_sockets_total > 500
Full Architecture: Before vs After Fix BEFORE (broken) OTA reboot 4,000 reconnects all at once Old sockets not destroyed CLOSE_WAIT pile up EMFILE at 8,192 FDs All 12,000 devices offline AFTER (fixed) OTA reboot 4,000 reconnects queued, max 100 Old socket.end(true) called first FD freed before new socket keepAlive config enabled 12,000 stable, spike over ~40s Key Numbers After Fix FD limit: 65,536(was 8,192) CLOSE_WAIT: < 10(was ~6,000) OTA downtime: 0 devices(was 8,000) Reconnect: ~45s(was: cascade)

Quick Reference: Socket Exhaustion Checklist

CheckCommandRed flag
FD limitulimit -n≤ 8,192 with 1k+ connections
FD usagels /proc/$PID/fd | wc -l> 80% of limit
CLOSE_WAIT countss -s> 100 for CLOSE_WAIT
Zombie connectionsApp metricsConnections in map but not CONNECTED
Reconnect stormConnection rate logs> 500 new connections in 5 seconds
FixImpactWhen to apply
Call socket.end(true) before replacingEliminates CLOSE_WAIT buildupAlways — this is the root fix
Set keepalive on MQTT/WS clientsBroker detects stale connectionsAny persistent connection
Raise ulimit -n to 65536+Headroom for spikesAny system with 1k+ connections
Rate-limit reconnects (Bottleneck, etc.)Absorbs reconnect stormsWhen OTA/mass reboots are possible
Export socket metrics to PrometheusEarly warning before cascadeProduction systems always

The lesson

Socket exhaustion is silent until it isn't. EMFILE errors don't appear in business metrics — they appear in system logs that no one watches. The fix is three things working together: proper socket lifecycle management (always destroy before replace), a realistic FD limit (3× your peak), and metrics that show you the trend before the cascade. The five-line socket.end() fix we added took two minutes. The incident it prevented took four engineers four hours to recover from.