Stop-the-World: When GC Freezes Everything What garbage collection actually does — and why your app randomly pauses STW PAUSE STW PAUSE app running app running THREAD 1 running THREAD 2 FROZEN THREAD 3 FROZEN GC WORKER cleaning During a GC pause 0 requests served Pause duration (worst case) 100ms – 5 sec @5k rps, 200ms pause 1,000 stalled

The Scene: Latency Spikes With No Errors

A backend service was running fine — p50 latency at 12ms, p99 at 80ms. Then every 30-90 seconds:

  • p99 latency spiked to 800ms+
  • Health checks failed intermittently
  • The load balancer marked instances as unhealthy
  • No errors. No exceptions. No CPU spike.

The app just froze for a few hundred milliseconds and then resumed like nothing happened. No log entry. No stack trace. The culprit? Garbage Collection pauses.

Let's understand what's actually going on.

What Is Garbage Collection?

When your code creates an object — a string, a list, a request handler — it takes up memory. In languages like C, you have to free that memory yourself. Miss one? Memory leak. Free it twice? Crash.

Garbage Collection automates this. The GC periodically scans memory, finds objects your code can no longer reach, and frees them. You write code, it cleans up after you.

1. ALLOCATE Code creates objects in memory 2. USE Objects are alive — code references them 3. UNREACHABLE No references left — object is garbage 4. GC RECLAIM GC frees memory (may pause app!) Most objects die young — created during a request and garbage milliseconds later The problem isn't GC itself — it's when it has to pause your entire app to do it

The problem comes in step 4. To find and reclaim dead objects, many GC implementations need to pause all your application threads. This is called a Stop-the-World (STW) pause. During this pause, your application is completely frozen — no requests processed, no responses sent, no timers fired. Nothing.

How Memory Is Organized: The Heap

All dynamically allocated objects live in a region of memory called the heap. Most modern GC implementations divide the heap into areas based on object age — this is called generational garbage collection.

Why generations? Because of one powerful observation: most objects die young.

Think about it. A request handler creates a DTO, serializes a response, builds a few strings. All of those are garbage within milliseconds. Only a few things — caches, connection pools, singletons — live for the lifetime of the application.

Generational Heap Layout YOUNG GENERATION Collected often — fast — minor GC NURSERY Brand new objects SURVIVOR Survived 1+ GC cycles Most objects die here (never promoted) ~90-95% of all allocations become garbage in young gen OLD GENERATION Collected rarely — expensive — major GC LONG-LIVED Caches, pools, configs PROMOTED Survived many GCs Full GC here = Stop-the-World pause Bigger heap = longer pause when full GC runs promote Young gen GC is fast (1-10ms). Old gen GC is where the painful pauses live.

How GC Actually Works: Mark and Sweep

The most common GC algorithm is Mark-and-Sweep. It works in three phases:

  1. Mark — starting from known "root" references (global variables, the stack, CPU registers), walk every reference chain and mark every reachable object as "alive"
  2. Sweep — scan the entire heap and free any object that wasn't marked (it's garbage)
  3. Compact (optional) — slide surviving objects together to eliminate fragmentation, updating all pointers
Mark-and-Sweep: How GC Finds Garbage BEFORE GC ROOT A B D C E F Blue = reachable, Gray = unreachable MARK PHASE (trace from roots) ROOT A B D MARKED = alive C E F NOT MARKED = garbage SWEEP (free garbage) A B D C E,F C, E, F freed — memory reclaimed Why Does This Need to Stop the World? Imagine counting people in a building while they're walking between rooms. You'd miss people or count them twice. You need everyone to freeze for an accurate count. GC has the same problem — if threads keep moving references while it scans, it could miss a live object (crash!) or fail to collect garbage (memory leak). So the runtime pauses everything, scans the heap, and lets threads resume.

Minor GC vs Major GC

Not all GC pauses are equal. The pain depends on which generation is being collected.

Minor GC — Fast and Frequent

When the young generation (nursery) fills up, the runtime triggers a minor GC. It copies the few surviving objects to the survivor space and wipes the nursery clean. Since most objects are already dead, this is very fast.

Minor GC — Young Generation Only BEFORE NURSERY (FULL) 1000 objects, 950 are dead minor GC AFTER NURSERY (EMPTY) 50 survivors copied out, 950 freed instantly Minor GC Stats 1 – 10 ms pause Runs every few seconds Usually not noticeable Minor GC only touches the young generation — small area, fast scan, short pause

Major GC (Full GC) — The Pause That Hurts

When the old generation fills up, the runtime triggers a full GC. This is the expensive one — it walks the entire old generation, marks every reachable object, sweeps the dead ones, and may compact memory. The bigger your heap, the longer this takes.

Major GC (Full GC) — Stop-the-World ALL APPLICATION THREADS SUSPENDED No requests. No responses. No health checks. No heartbeats. Nothing moves. 1. Walk entire heap Mark every live object 2. Sweep dead objects Free unreachable memory 3. Compact (optional) Defragment heap Pause time: 50ms – 5s+ Bigger heap = longer pause. 4GB heap might pause 200ms. 16GB heap? Multiple seconds.
A 200ms GC pause on a service handling 5,000 requests/sec = 1,000 requests frozen at once. Those requests either wait (adding 200ms to latency), time out, or cascade (callers retry, creating even more load).

What Triggers a Full GC?

Full GC doesn't just happen randomly. These are the common triggers, regardless of language:

Common Full GC Triggers Old Gen Full Too many objects promoted from young generation Promotion Failure Young GC tries to promote but old gen has no space left Explicit GC Call System.gc(), runtime.GC(), global.gc() — avoid these! Memory Leak Objects accumulate and never get released — heap always grows Heap Too Small GC runs constantly because there's never enough space Large Allocations Big arrays/buffers go directly to old gen — fill it faster The Single Most Common Cause? Memory leaks — objects that are reachable but no longer needed

The Cascade: How One GC Pause Kills a Cluster

GC pauses don't just affect one request — they can cascade across your entire infrastructure.

How a GC Pause Cascades Into an Outage INSTANCE A Full GC → 300ms pause Health check fails → LB removes it INSTANCE B Absorbs A's traffic → heap spikes GC triggered sooner → B pauses too INSTANCE C Now handles ALL traffic alone Heap explodes → Full GC SERVICE OUTAGE One GC pause → traffic redistribution → more GC pressure → cascading failure

How to Detect GC Problems

Before you can fix GC pauses, you need to see them. The symptoms have a distinctive fingerprint:

GC Pause Fingerprint — What to Look For p50 is normal, p99 spikes periodically The classic GC fingerprint — not every request is slow Health checks fail intermittently LB removes instances → traffic shifts → cascade No errors in logs during spikes App freezes silently — no exception, no error, nothing Memory usage sawtooth pattern Grows steadily → sudden drop → grows again CPU isn't maxed, DB is fine, no network issues — yet the app periodically freezes for 100-500ms

Universal Fixes (Any Language)

Regardless of what language you're using, these principles reduce GC pressure:

  1. Reduce allocations in hot paths — every object you create is future garbage. Reuse buffers, avoid unnecessary intermediate objects, pre-allocate collections with known sizes.

  2. Fix memory leaks — if your heap keeps growing after each GC cycle, objects are reachable but no longer needed. Common culprits: unbounded caches, forgotten event listeners, closures capturing large scopes.

  3. Right-size the heap — too small means constant GC. Too large means catastrophic pauses when full GC eventually runs. Rule of thumb: 3-4x your live data set.

  4. Monitor GC in production — every language has GC logging/metrics. Enable them. You can't fix what you can't see.

  5. Avoid explicit GC callsSystem.gc(), runtime.GC(), global.gc() — these force a full collection at the worst possible time.

How Different Languages Handle GC

Every managed language has a garbage collector, but they make very different tradeoffs. Here's how the major ones compare:

GC Across Languages — How Each One Works Language GC Algorithm Max STW Pause Key Tradeoff Tuning Knob JavaScript (V8 / Node.js) Generational: Scavenge + Mark-Sweep-Compact 50 – 300ms+ Single-threaded event loop blocks completely --max-old-space-size --max-semi-space-size Go (1.19+) Concurrent tri-color mark-and-sweep < 1ms Uses more CPU for concurrent GC work GOGC, SetMemoryLimit GODEBUG=gctrace=1 Java (JVM) G1 (default), ZGC, Shenandoah Parallel, Serial G1: 50–200ms ZGC: <1ms Most GC options — pick collector for your needs -Xmx, -XX:+UseZGC MaxGCPauseMillis C# / .NET (CLR) Generational (Gen0/1/2) Background + concurrent Gen0/1: <1ms Gen2: 10–100ms+ Background GC mostly concurrent in .NET 6+ GCSettings.LatencyMode Server vs Workstation GC Python (CPython) Reference counting + cycle collector Usually minimal GIL limits concurrency anyway — GC less visible gc.disable() gc.set_threshold() Key insight: Go and Java (ZGC) have solved sub-millisecond pauses. If GC pauses are your bottleneck, the language/collector choice matters enormously.

Key Takeaways

  1. GC pauses are real outages. During a Stop-the-World pause, your app serves zero requests. Health checks fail, timeouts fire, cascading failures begin.

  2. Most objects die young. That's why generational GC exists — young gen collection is fast, old gen collection is painful. Keep objects short-lived whenever possible.

  3. Full GC should be rare. If you're seeing frequent full GCs, you have a heap sizing problem, a memory leak, or both.

  4. Size the heap at 3-4x your live data set. Too small = constant GC. Too large = catastrophic pauses when full GC finally runs.

  5. Fix the code before tuning the runtime. Unbounded caches, forgotten event listeners, closures capturing large scopes — these create garbage faster than any collector can clean up.

  6. Always monitor GC in production. Enable GC logging. Watch for the sawtooth memory pattern. Track p99 spikes that correlate with GC events.

  7. The best garbage is garbage never created. Reuse buffers, pre-allocate collections, avoid intermediate allocations in hot paths. The fastest GC cycle is the one that never runs.