C10K vs C10M: Two Scaling Problems People Confuse for One

What does "concurrent" actually mean?

Imagine a coffee shop with one barista.

Sequential: serve customer A fully, then B, then C. If A is slow, B and C wait forever.
Concurrent: start A's order, while the milk steams, take B's order, hand C their receipt. Many things are in progress at once, even with one person working.

A server is the same. Each user who is logged in and has an open connection (chat, game, live feed) is a customer "still in the shop." The server's job is to keep all of them happy without wasting energy.

The C10K problem

In the late 1990s, servers started hitting a weird wall. They were not out of CPU. They were not out of memory. But once around 10,000 users connected at once, everything slowed to a crawl.

Why it broke: the Apache story

The most popular web server of that era was Apache. Apache's design was simple and honest: fork a new process (or thread) for every incoming connection. A user connects → Apache spawns a worker → that worker handles only that user. Clean, easy to debug.

It worked great for 100 users. It worked okay for 1,000 users. At 10,000 users, it collapsed. Here is exactly why:

Memory explosion. Each process takes 2–8 MB of RAM just to exist (stack + page tables + loaded code). 10,000 processes × 5 MB = 50 GB of RAM — and you have not handled a single request yet.
Context switch storm. The CPU can only run one thing at a time per core. To give every process a turn, the OS has to context switch — save the current process's registers, CPU state, and memory mappings, then load the next one's. Each switch costs ~1–10 microseconds. With 10,000 processes competing, the CPU spends more time switching than doing work. This is the real killer.
Cache thrashing. Every context switch wipes the CPU's L1/L2 cache. The next process starts "cold" and has to refetch everything from RAM (100x slower). More switches = colder caches = slower everything.
Most workers sleep. At any moment, 9,900 of those 10,000 processes are just sitting there waiting on a slow user's network. You are paying the full memory and scheduling cost for processes that do nothing.

What is a context switch, really? The CPU has a small set of registers holding the current program's state — instruction pointer, stack pointer, variables in registers, memory page tables. When the OS switches from process A to B, it must save all of A's state to RAM and load all of B's state. This takes time, and it invalidates the CPU cache. 10,000 processes ≈ millions of these switches per second ≈ CPU drowning in bookkeeping.

The core insight: out of 10,000 users, maybe only 100 are actively sending data right now. Why do we have 10,000 processes? The answer C10K gives: we shouldn't.

The fix: event loop + non-blocking I/O

Instead of one thread per user, use one (or a few) threads that handle everyone.

The trick is to never let a thread get "stuck" waiting. Normal code does this:

data = read(socket)   // thread sleeps here until data arrives

With non-blocking I/O, the thread asks the operating system:

"Tell me which users have new data ready RIGHT NOW."

The OS returns a list. The thread handles them quickly, then asks again. Nobody sits idle. One thread can serve thousands of users because it only does work when work exists.

The tools that made this fast: epoll (Linux), kqueue (BSD/Mac), IOCP (Windows). These are OS features that let your program ask "who is ready?" in a way that scales to millions. You do not have to memorize the names — runtimes like Node.js, Nginx, and Go use them under the hood.

Before "hello": the handshakes that set up the connection

Before a user can send a single byte of chat data, two handshakes must happen. Both cost time and server resources — and both are multiplied by however many users are connecting at once.

1) The TCP 3-way handshake

TCP is a "reliable" protocol — it guarantees bytes arrive in order with no gaps. To do that, both sides must first agree they are connected. This is the famous 3-way handshake:

SYN → client says "I want to start a connection, my starting number is X".
← SYN-ACK server says "okay, I got X, my starting number is Y".
ACK → client says "got Y, we're connected".

Only after this can data flow. On your server, this is where accept() comes in. Your app calls accept() on the listening socket, and the kernel hands back a new file descriptor — that fd represents this one user's connection. Every user gets their own fd.

2) The TLS handshake (for HTTPS / WSS)

If the user is on HTTPS, there's another handshake on top of the TCP one. The client and server exchange certificates, verify identity, and negotiate encryption keys. This takes another 1–2 round trips and burns CPU for the crypto math (RSA or ECDSA verification).

Why this matters for C10K: if 10,000 users connect at the same time (e.g., after a deploy), your server faces 10,000 TCP handshakes + 10,000 TLS handshakes + 10,000 new fds allocated + 10,000 new socket buffers — all within seconds. This is why modern systems terminate TLS at the edge (a CDN or load balancer) so app servers don't pay the crypto cost per user.

The full journey: from user's keyboard to your app

Now that the connection is established, let's trace what happens when the user actually types "hello" in their chat app and your server reads it. Every step is real work.

Now the same path in plain words, step by step:

User types "hello" in their chat app on a phone or browser.
The client's OS wraps it in TCP packets. Each packet gets the destination IP, port, and a sequence number so the receiver can reassemble them in order.
The packets travel the internet — WiFi → home router → ISP → backbone → your datacenter. This may take 20–200 ms depending on distance.
Your server's network card (NIC) receives the packets. The NIC copies them into RAM and tells the CPU "new data arrived" (this is called an interrupt).
The kernel's TCP stack reassembles the packets into the original byte stream and checks for missing pieces. Good bytes go into a socket receive buffer — one buffer per open connection.
The kernel marks the socket's file descriptor (fd) as "ready to read". An fd is just a number (like 42) that represents one connection. Your app does not know yet — the data is sitting in the kernel, waiting.
Your app finds out via epoll_wait(). Earlier, your app told the kernel "wake me when any of these sockets have data." Now that call returns with a list: "fd 42 is ready."
Your app calls read(fd, buffer). The kernel copies the bytes from the socket buffer into your app's memory. Now "hello" is a string in your code.
Your handler runs. You parse the message, update the chat, maybe write to a database. Done.

The key insight for C10K: steps 4–6 happen automatically for every connection, even idle ones. Step 7 is where the magic is. One epoll_wait() call can watch 10,000 sockets at once and only wake your app for the few that actually have data. That is why one thread can serve thousands of users.

What the old "thread per user" way did wrong: instead of step 7, each thread called read(fd) directly and got stuck sleeping until data arrived on its one socket. 10,000 threads → 10,000 sleeping threads → wasted RAM and CPU switching.

Sending the reply back: the write path

Your app does not just read — it also writes back. "Got your hello" needs to travel the same route in reverse:

Your app builds a response in its own memory (e.g., a JSON string).
Your app calls write(fd, data). The kernel copies those bytes from user memory into the socket's send buffer.
The TCP stack chops the bytes into packets, adds headers, and hands them to the NIC driver.
The NIC sends the packets out onto the wire toward the client.
TCP tracks acknowledgements. If a packet is lost, the kernel retransmits it — your app never notices.
The client's OS reassembles and delivers the bytes to the client's app.

Backpressure: if the client's network is slow, the send buffer fills up. A well-written event loop checks "is this socket writable?" before writing, just like it checks "is it readable?" for reads. If you ignore backpressure and keep writing, your server's RAM fills with unsent data for slow clients.

The real technical breakthrough: select() → poll() → epoll()

If the event loop idea is so great, why did it take 20 years to get good? Because the tool the kernel gave apps to ask "who is ready?" was terrible at first. The journey from broken to fast is the actual story of C10K.

active fds This unlocked C10K

Here's the concrete difference. Imagine you have 10,000 open sockets but only 5 have data right now:

select() / poll(): your app sends all 10,000 fds to the kernel. The kernel loops through all 10,000 checking each one. It returns. Your app loops through all 10,000 again to find which 5 are ready. Work done: 20,000 checks. Useful work: 5.
epoll(): you registered your 10,000 fds once (long ago). When the kernel gets data for a socket, it silently adds that fd to an internal "ready list". When you call epoll_wait(), the kernel hands you exactly those 5 ready fds. Work done: 5 items returned. Useful work: 5.

This is the real C10K breakthrough. Linux shipped epoll in kernel 2.6 (2002). BSD/Mac had kqueue. Windows had IOCP. Once apps could ask "who is ready?" in O(1) instead of O(n), holding 10,000 connections became cheap. Nginx, Node.js, and Go all lean on these APIs — that is literally why they are fast.

What still goes wrong at C10K

Async cannot fix slow code. If every request runs 50 database queries, no event loop will save you.
File descriptor limits. Every open connection uses one file descriptor; the OS has a cap (often 1024 by default). Raise it or you will hit a wall.
Memory per connection. Each connection still needs buffers. 10,000 connections × 64 KB = 640 MB just for buffers.

The C10M problem

Now imagine you are WhatsApp, Discord, or a big game. You do not have 10,000 users. You have 10 million. Can the same trick work?

No. And this is where people get confused. C10M is not just "C10K with more zeros." It is a different problem entirely.

The key shift: C10K is about one server. C10M is about the whole system — many servers, a database, caches, a network, and how they work together.

Why one server cannot do it

Even if your server is a god-tier event loop, one machine has hard limits:

RAM: 10 million connections × even 10 KB each = 100 GB just for connection state.
Network card: one NIC can only push so many packets per second.
CPU: TLS encryption, parsing, and business logic all eat cycles.
One crash = everyone down. No redundancy.

The fix: split the work across many machines

Real systems handle millions of users by spreading them out. Here is the typical shape:

Each layer solves a specific problem:

CDN / edge: serves static files close to the user, and handles TLS (HTTPS) so the app servers do not have to decrypt every byte.
Load balancer: spreads users evenly across app servers. If one app server dies, others take over.
Many app servers: each one only handles a slice of users. Need more capacity? Add more servers.
Cache (Redis, Memcached): stores hot data in memory so you do not hit the database for every request.
Sharded database: split data across many DB servers so no single DB is a bottleneck.

Sharding in plain words: instead of one giant database with all 10 million users, you have 10 databases with 1 million users each. Users A–H on DB 1, I–P on DB 2, and so on. Each DB only handles its slice.

What still goes wrong at C10M

Thundering herd. You deploy a new version, everyone disconnects, everyone reconnects at once — your own traffic becomes a DDoS on yourself. Fix: reconnect with random delays (backoff + jitter).
Hot shards. One celebrity on your app has 10 million followers. That shard melts. Fix: special handling for hot users, or duplicate their data.
Cache stampede. A popular cache entry expires, and 1,000 servers all try to rebuild it at the same time. Fix: locks, or "stale-while-revalidate" caching.
Cost. Millions of idle connections still cost real money in bandwidth and RAM.
Tail latency. Your average response time is fine, but 1% of users are waiting 30 seconds because one DB shard is slow. Those users leave.

Side-by-side comparison

Question	C10K	C10M
What are we scaling?	One server	The whole system
Main problem	Threads waste memory	One machine cannot do it all
Main fix	Event loop + non-blocking I/O	Load balancer + many servers + sharded DB
You debug	fd limits, event loop blocking	DB hotspots, reconnect storms, p99 latency
Beginner mistake	Thinking more threads = faster	Copying Netflix's architecture on day one
Year it got popular	~2000s	~2010s

The five things to remember

C10K is a single-server problem. Fix it with non-blocking I/O and an event loop. Stop making one thread per user.
C10M is a system-design problem. Fix it with load balancers, many servers, caches, and sharded databases. No coding trick will do this alone.
Async code does not fix slow databases. If your DB is the bottleneck, epoll will not help.
Split the work early. Separate TLS, routing, compute, and data into different layers so each can scale on its own.
Measure before you optimize. Look at RAM per connection, DB query time, and p99 latency. The label (C10K, C10M) matters less than your actual graphs.

If you are new to this: do not panic. These ideas took the industry 20 years to figure out. Start small — run a tiny WebSocket server, watch how many connections it holds, then read back through this article. The graphs you see in your own tool will teach you more than any blog post.