Socket Exhaustion in High-Concurrency IoT Systems

The Incident: 8,000 Devices Go Dark

We were running a Node.js backend managing 12,000 MQTT-connected IoT devices — industrial sensors, environmental monitors, smart building controllers. Each device maintained a persistent WebSocket/MQTT connection to our broker.

At 11:04pm, a firmware OTA push to ~4,000 devices caused them to reboot simultaneously. They came back online within 60 seconds and tried to reconnect. What followed was a cascade that took down the entire fleet — not just the rebooted devices.

What Is a Socket and Why Does It Exhaust?

Every TCP connection — whether it's a WebSocket, MQTT session, or plain HTTP — requires an open file descriptor on the operating system. The OS has a hard limit on how many a single process can hold.

The Root Cause: Zombie Sockets

Our MQTT client pool was creating a new connection object for each device reconnect but not properly destroying the old one. The old TCP socket entered TIME_WAIT or CLOSE_WAIT state and held the file descriptor open — sometimes for minutes.

// THE PROBLEMATIC CODE
// When a device reconnected, we'd create a new MQTT client
// without explicitly ending the old one first

const clients = new Map(); // deviceId → MqttClient

function handleDeviceReconnect(deviceId) {
  // BUG: old client is just overwritten, not destroyed
  const client = mqtt.connect(BROKER_URL, {
    clientId: deviceId,
    clean: false,  // persistent session
    // Missing: keepAlive, reconnectPeriod config
  });

  clients.set(deviceId, client); // old client leaked!

  client.on('connect', () => {
    console.log(`${deviceId} connected`);
  });
}

Diagnosing Socket Exhaustion in Production

# Check open file descriptors for a Node.js process
PID=$(pgrep -f "node server.js")

# Count open sockets
ls -la /proc/$PID/fd | wc -l

# See socket states
cat /proc/net/tcp | awk '{print $4}' | sort | uniq -c | sort -rn
# 01 = ESTABLISHED, 06 = TIME_WAIT, 08 = CLOSE_WAIT

# Check current ulimit for the process
cat /proc/$PID/limits | grep "open files"

# Real-time socket count (watch it climb)
watch -n 1 "ls /proc/$PID/fd | wc -l"

# Netstat breakdown by state
ss -s
# Output:
# Total: 18432
# TCP:   16408 (estab 8192, closed 4216, orphaned 0, timewait 4000)

The Fix: Proper Connection Lifecycle Management

// FIXED VERSION
const clients = new Map(); // deviceId → MqttClient

function handleDeviceReconnect(deviceId) {
  // Step 1: Properly destroy the old connection first
  const existing = clients.get(deviceId);
  if (existing) {
    existing.end(true); // force=true: destroy socket immediately
    clients.delete(deviceId);
  }

  // Step 2: Create new connection with proper keepAlive config
  const client = mqtt.connect(BROKER_URL, {
    clientId: deviceId,
    clean: false,
    keepalive: 60,          // send PING every 60s — broker detects stale
    connectTimeout: 10000,  // fail fast if broker unreachable
    reconnectPeriod: 0,     // disable auto-reconnect (we manage it)
    // Add socket-level keepAlive for the TCP layer too
    socketOptions: {
      keepAlive: true,
      initialDelay: 30000   // OS-level TCP keepalive after 30s idle
    }
  });

  clients.set(deviceId, client);

  client.on('connect', () => {
    console.log(`${deviceId} connected`);
  });

  client.on('error', (err) => {
    console.error(`${deviceId} error: ${err.message}`);
    client.end(true);
    clients.delete(deviceId);
  });

  // Critical: clean up on any close
  client.on('close', () => {
    if (clients.get(deviceId) === client) {
      clients.delete(deviceId);
    }
  });
}

Fix 2: Raise the File Descriptor Limit

Even with proper cleanup, a 12,000-device system needs a higher FD limit as headroom. The OS default of 8,192 is too low.

# /etc/security/limits.conf — permanent change
# Add these lines:
*    soft    nofile    65536
*    hard    nofile    65536
root soft    nofile    65536
root hard    nofile    65536

# /etc/sysctl.conf — kernel-level limit
fs.file-max = 2097152

# Apply immediately (without reboot):
sysctl -p

# For systemd services (Node.js as a service):
# /etc/systemd/system/iot-backend.service
[Service]
LimitNOFILE=65536

Fix 3: Connection Pooling with Backpressure

Reconnect storms (like our OTA reboot) are dangerous without rate limiting. When 4,000 devices reconnect simultaneously, don't accept all of them at once.

// Reconnect queue with controlled concurrency
const Bottleneck = require('bottleneck');

const reconnectLimiter = new Bottleneck({
  maxConcurrent: 100,     // max 100 simultaneous reconnects
  minTime: 10,            // min 10ms between reconnects
  reservoir: 200,         // max 200 reconnects in burst
  reservoirRefreshAmount: 200,
  reservoirRefreshInterval: 10 * 1000, // refill every 10s
});

// Wrap the reconnect handler with the limiter
async function queueDeviceReconnect(deviceId) {
  return reconnectLimiter.schedule(() =>
    handleDeviceReconnect(deviceId)
  );
}

// Now reconnect storms are absorbed gracefully
// 4,000 reconnects → spread over 40 seconds
// instead of hammering all at once

Fix 4: Monitoring — Catch It Before It Cascades

// Add to your Node.js server: export socket metrics
import { register, Gauge } from 'prom-client';

const socketGauge = new Gauge({
  name: 'iot_open_sockets_total',
  help: 'Number of open MQTT socket connections',
});

const zombieGauge = new Gauge({
  name: 'iot_zombie_sockets_total',
  help: 'Connections in Map but not CONNECTED state',
});

// Update every 30 seconds
setInterval(() => {
  socketGauge.set(clients.size);

  let zombies = 0;
  for (const [, client] of clients) {
    if (!client.connected) zombies++;
  }
  zombieGauge.set(zombies);
}, 30_000);

// Alert rule (Prometheus/Grafana):
// ALERT SocketExhaustionRisk
// IF iot_open_sockets_total > 50000
// OR iot_zombie_sockets_total > 500

Quick Reference: Socket Exhaustion Checklist

Check	Command	Red flag
FD limit	`ulimit -n`	≤ 8,192 with 1k+ connections
FD usage	`ls /proc/$PID/fd \| wc -l`	> 80% of limit
CLOSE_WAIT count	`ss -s`	> 100 for CLOSE_WAIT
Zombie connections	App metrics	Connections in map but not CONNECTED
Reconnect storm	Connection rate logs	> 500 new connections in 5 seconds

Fix	Impact	When to apply
Call `socket.end(true)` before replacing	Eliminates CLOSE_WAIT buildup	Always — this is the root fix
Set `keepalive` on MQTT/WS clients	Broker detects stale connections	Any persistent connection
Raise `ulimit -n` to 65536+	Headroom for spikes	Any system with 1k+ connections
Rate-limit reconnects (Bottleneck, etc.)	Absorbs reconnect storms	When OTA/mass reboots are possible
Export socket metrics to Prometheus	Early warning before cascade	Production systems always

The lesson

Socket exhaustion is silent until it isn't. EMFILE errors don't appear in business metrics — they appear in system logs that no one watches. The fix is three things working together: proper socket lifecycle management (always destroy before replace), a realistic FD limit (3× your peak), and metrics that show you the trend before the cascade. The five-line socket.end() fix we added took two minutes. The incident it prevented took four engineers four hours to recover from.