Why Your Autoscaling Isn't Working: The Truth About Spiky Traffic

First, What Does "Reactive" Even Mean?

Before we go further, let's lock down the word reactive, because the whole blog is built on it.

Reactive scaling works like a thermostat. It doesn't predict anything. It doesn't know a cricket match is starting. It just watches a number — CPU, memory, requests per second, queue depth — and reacts after that number crosses a line.

Reactive = "only after the event happens, then act." Traffic has to spike first. Then the metric goes up. Then the controller notices. Then it asks for more pods. Then the pods boot. Only then does capacity show up.

That's why reactive is always late. By definition, it cannot act before the event — it has to see the event first.

Now compare that to the others:

Scheduled — "every Friday at 7pm, scale to 400 pods." The clock is the trigger. It doesn't need to see traffic.
Predictive — "based on last 30 days of traffic, the system forecasts what's coming and scales ahead." A forecast is the trigger.
Manual — "it's IPL final, please scale us up." A human is the trigger.

Only reactive waits for traffic to actually arrive before acting. Which is exactly why it fails on spikes.

The One Thing People Get Wrong About Scaling

Engineers treat scaling as a single concept. It isn't. "Scaling" is actually two orthogonal decisions stacked on top of each other:

Direction — do you add more machines (horizontal) or bigger machines (vertical)?
Trigger — what causes the scale action? A metric you observe (reactive), a forecast (predictive), a clock (scheduled), or a human (manual)?

You can mix these. Horizontal + reactive is what Kubernetes HPA does out of the box. Horizontal + scheduled is what you do before a known sale. Vertical + manual is what a DBA does at 2am when the DB is melting. The failure mode you meet in production is usually the wrong trigger, not the wrong direction.

Let's lay out the whole map before diagnosing what breaks.

Mental model: direction decides what scales (CPU? memory? connections?). Trigger decides when. Most production outages I've seen are not a wrong-direction problem. They're a wrong-trigger problem — usually reactive where reactive doesn't work.

Horizontal vs Vertical — The Part Everyone Knows

Quick refresher, because the rest of the post leans on it.

In 2026, most production web workloads default to horizontal. Stateless HTTP services, workers, frontends, gRPC — all horizontal. Vertical is still how you scale most databases (until you shard), some caches, and anything with per-process state that's expensive to move.

The interesting question isn't direction. It's this: how do you decide to add or remove capacity?

Reactive Scaling: The Default Everyone Uses

Reactive scaling is simple and elegant. You pick a metric (CPU, memory, requests/sec, queue depth), set a target (70% CPU), and let the controller add or remove replicas to hit it.

Kubernetes HPA is the poster child. So is AWS target-tracking ASG. The control loop looks like this:

This loop is fine when traffic grows smoothly. HPA's default sync period is 15 seconds, with a 5-minute stabilization window on scale-down (so pods aren't removed too quickly) and no stabilization on scale-up — but scale-up still can only grow at most +100% or +4 pods per 15-second window by default. ASG target-tracking evaluates CloudWatch alarms (1-minute metric granularity by default), and on top of that you pay 60–180s for the EC2 instance to boot and register with the cluster.

Add those up:

Metric propagation: 15 – 60s (metrics-server scrape interval, Prometheus scrape, CloudWatch 1-min granularity)
Decision + cooldown: 30 – 300s (HPA 15s sync + stabilization windows; ASG alarm evaluation periods)
Provisioning: 30 – 180s (image pull, node provisioning if cluster itself is full, pod scheduling)
Warm-up: 15 – 180s (app start, JIT, cache fill, readiness probe passing, connection pool ramp-up)

Best case: ~90 seconds. Worst case (node scale + cold image pull + slow JVM start + cache warm): 5 – 8 minutes. If your traffic peaks in 30 seconds, you've already served your users 500s, retries, and timeouts by the time the pods are ready.

The brutal truth: reactive scaling assumes the thing you're trying to absorb moves slower than the thing that absorbs it. Spiky traffic violates that assumption.

What "Spiky" Actually Means

Not all fast-growing traffic is a spike. A hockey-stick pattern over a day is not a spike. A 2x diurnal peak is not a spike. For this discussion, a spike is traffic whose rate of change outruns your scaling loop's time-to-capacity.

Empirically, most production "spikes" we see fall into one of four shapes:

Each shape breaks reactive scaling in a slightly different way:

Step — reactive eventually catches up, but you've taken minutes of 5xx and timeouts before it does. CPU is not even the right signal here; request queue depth is.
Thundering herd — by the time new capacity is warm, the spike is already over. You scaled out for nothing, then scale back in, then the next push hits. You're always fighting the last war.
Wave — the sweet spot for reactive. Gradual enough for the loop to keep up.
Sawtooth — HPA adds pods on the peak, removes them during the trough, then adds them again. Flapping. You're paying for capacity twice and paging yourself every cycle because readiness is bouncing.

Case Study 1: Dream11 on IPL Final Day

Spike shape: Step (deadline-driven) | What broke: Reactive HPA + DB writes | What solved it: Scheduled pre-scale + horizontal DB sharding + async queue for writes. Reactive scaling is only the safety net.

It's 7:25pm. IPL final day. CSK vs MI. Toss is in 5 minutes. Across India, millions of users open Dream11 at the same time to lock in their team before the deadline. If you've ever used the app, you know the feeling — everyone waits till the last moment.

So what's the scaling problem here? Simple: every user shows up in the same 60-second window. HPA can't save you. Dream11 knows this, and they've written about how they solve it.

Why IPL Final Traffic Is Different

A normal day on Dream11 is a gentle wave — morning ramp, evening peak. HPA handles it fine. IPL final breaks three assumptions at once:

Hard deadline. You cannot submit after toss. Every user piles into the last 5 minutes. The spike is not a bug — it's baked into how the product works.
Writes, not reads. Saving a team hits the database. CDNs can't cache a write. You can queue it, but someone still has to store every team.
Real money. "Try again" is not acceptable when there's money on the line.

The Key Mindset Shift

If you're an engineer scaling something for a Dream11-style workload, the question you need to ask is not "how fast can my HPA react?" It's:

"What do I know about this spike in advance, and how do I make my system behave like it's a normal day?" For Dream11, the whole game is to turn the IPL-final step function into something that looks like Tuesday afternoon to the actual database. Pre-scale. Pre-warm. Queue. Shard. Shed. HPA is the safety net, not the strategy.

This pattern repeats everywhere — Zomato during Indian Cricket wins, Zerodha at 9:15am market open, Swiggy at 8pm Sunday. Known spike on a calendar = scheduled scaling, not reactive.

Sources — Dream11 Engineering Blog

Dream11's engineering team publishes at engineering.dream11.com — read their posts on handling IPL-day traffic, Kafka-based write absorption, and sharded Postgres architecture.
Search for "Dream11 scaling IPL" on Medium — multiple deep-dives from their SREs walking through pre-scale playbooks, contest-list cache warming, and circuit breakers.
Talks from Dream11 at events like Rootconf and AWS re:Invent cover the same playbook — watch for "Dream11 fantasy sports scale" on YouTube.

Case Study 2: Hotstar and JioCinema — Scaling IPL Live Streaming

Spike shape: Step + "wow moments" inside it | What broke: Reactive scaling cannot move fast enough for 2M → 15M in one over | What solved it: Scheduled pre-scale + CDN pre-positioning + client-side jitter + graceful shedding (Panic Mode). Reactive HPA only handles drift.

If Dream11 is the write-heavy spike, live-streaming an IPL match is the read-heavy spike. India has produced the two biggest examples the internet has ever seen — Hotstar in 2019 and JioCinema in 2023. Both teams have spoken openly about how they did it, which is why we can learn from real numbers instead of guessing.

Hotstar's India vs New Zealand World Cup 2019 semi-final peaked at 25.3 million concurrent viewers — a world record at the time. Their engineers have publicly described "wow moments" during big matches (a wicket, a six, Dhoni walking in) where concurrency can double or more inside a single over — adding several million viewers in just a few minutes.

JioCinema broke that record during the IPL 2023 final at a reported 32 million concurrent viewers. The techniques are nearly identical to Hotstar's — because reactive scaling cannot solve this shape of spike, full stop.

What They Actually Did (From Public Engineering Talks)

The Hotstar engineering team has spoken about this publicly at events like Scale by the Bay and in their blog. The architecture boils down to five moves — every one of them is about not relying on reactive scaling:

Where Reactive Scaling Fits in This Picture

It doesn't, really. HPA on the Hotstar origin layer exists — but it's there to handle the drift between the pre-scaled baseline and actual traffic, not to absorb the spike. If HPA has to do real work during a match, something has already gone wrong with the plan.

The pattern to remember: read-heavy spikes (Hotstar, JioCinema, news sites, viral content) are solved with CDN + pre-scaling + client jitter. Write-heavy spikes (Dream11, payments, ticketing) are solved with pre-scaling + sharding + queues. In both cases, reactive scaling is a safety net, not the plan.

Sources — Hotstar & JioCinema Engineering

Hotstar's widely-shared engineering post "How Hotstar Scales During IPL" — available on their tech blog and Medium, covers the 25.3M concurrent milestone, the "Panic Mode" shedding system, and their CDN strategy.
Hotstar engineers have spoken at QCon, InfoQ, and Scale by the Bay — search for "Hotstar scale 25 million concurrent" on YouTube for the original talks with architecture diagrams.
JioCinema's engineering team has published on Medium and spoken at industry events about hitting ~32M concurrent during IPL 2023 — search "JioCinema scaling IPL 2023" for the deep-dives on their CDN pre-positioning and client-side jitter.
CDN and cloud-provider post-event reports (Akamai, AWS reInvent sessions) often reference the JioCinema IPL peak traffic numbers — useful for independent verification.

Case Study 3: The OTT Launch That Cratered in 90 Seconds

Spike shape: Step (hard launch time) | What broke: Pure reactive HPA — metric lag + cold JVM starts + cluster autoscaler = 8 min to full capacity, spike took 17 seconds | What solved it: Scheduled pre-scale 15 min before launch + aggressive edge cache. Reactive now only handles drift.

A streaming platform was launching a tentpole show. Their capacity plan was "HPA target 60% CPU, min 40 replicas, max 400." They load-tested at 3x baseline. Launch went out at 8:00pm IST. The home feed service died at 8:00:47pm.

What happened?

The post-mortem showed three independent reasons HPA couldn't save this:

Metric lag. CloudWatch custom metrics were on a 60s resolution. By the time HPA saw CPU at 90%, traffic had already been there for 45 seconds.
Cold starts. The service was JVM-based. P50 warm-up was 55s, P99 was 2m10s (classloader + JIT + connection pool). Pods got scheduled fast; they weren't serving fast.
Cluster autoscaler. They hit node capacity at T+40s. Cluster autoscaler kicked in, but EC2 instance boot + kubelet registration + image pull took another 3 minutes.

HPA was doing its job correctly. Its job just took 8 minutes. The launch took 17 seconds.

What Actually Fixed It

They did not tune HPA harder. They changed the architecture:

Key idea: if the event is on a calendar, scaling shouldn't wait for a metric. Reactive is the wrong trigger for known events. Scheduled pre-scale, combined with a cache layer that flattens the step into a trickle, turns this from a capacity problem into a cache problem — and caches are much easier to scale.

Case Study 4: Notification Fan-Out — The Spike You Create Yourself

Spike shape: Thundering herd (seconds-long peak) | What broke: Reactive HPA scaled stateless tier but DB connection pool saturated; client retries amplified load | What solved it: Client jitter + staged push cohorts + queue in front of DB + edge cache. The fix is not "scale faster" — it's "never let the peak happen."

A ticketing platform sent a push notification to 18 million users at 10:00:00 AM sharp announcing ticket drops. Their app opened a screen that hit three APIs: /events, /availability, /recommendations. HPA kicked in and actually scaled up. That wasn't the problem. The problem was that by 10:00:12, every downstream service was dead — including the database.

Three things went wrong even though scaling "worked":

Scaling the stateless tier doesn't help if the stateful tier can't keep up. HPA added API pods. Each new pod opened 30 DB connections from its HikariCP pool. The DB had a hard cap of 2,000 connections. Adding pods made the DB worse.
Client retry policies are load amplifiers. The mobile SDK retried failed calls up to 3 times. A 5xx became 4 requests. The spike was multiplied by the clients trying to escape it.
The spike was over before HPA finished. By T+90s, the app opens were done. But the cluster was now running at 4x its needed size with a wrecked downstream DB.

The Fix: Don't Scale, Smooth

The right architecture for thundering herds is the opposite of "scale faster." It's dampening:

The counterintuitive takeaway: on a thundering herd, your job isn't to scale to the peak. It's to prevent the peak from happening. Client jitter, staged fan-out, queues, and edge caching reshape a step into a wave — and waves are what reactive scaling was designed for.

Case Study 5: Sawtooth — The HPA That Never Stopped Flapping

Spike shape: Sawtooth (periodic bursts) | What broke: Reactive HPA constantly adding/removing pods in rhythm with cron — pods flap, costs 2.3× higher, p99 doubles | What solved it: Scheduled floor via CronHPA + cron-worker jitter + queue to decouple workers from API. Fight the forcing function, not the HPA.

A data platform ran ETL cron jobs every 5 minutes that each triggered a burst of API calls from workers to a metadata service. HPA saw CPU > 70%, scaled from 10 → 40 pods. 90 seconds later the burst was over. HPA's scale-down stabilization window was 5 minutes, so right as it was about to remove pods, the next cron fired and it needed them again. Except the new cron fired at the exact moment scale-down was finally removing pods.

Result: pods constantly being added/removed, readiness probes flapping, tail-latency p99 doubled, and a spend 2.3x higher than necessary. No outage. Just constant pain.

Fixes, in descending order of how often they actually work:

Schedule the capacity — a cron-based autoscaler (KEDA's cron scaler, or the kubernetes-cronhpa-controller) keeps a floor of 25 replicas for the 90 seconds around each cron tick. Effectively pause HPA's reactive component during that window.
Smooth the source — spread the cron workers' start over 30 seconds with jitter so they don't bang the API simultaneously.
Right-size the stabilization window — bump scale-down stabilization so HPA doesn't remove pods during the trough. Accept slightly higher steady-state cost in return for no flapping.
Use a queue — decouple workers from metadata service with a queue. Now the metadata service sees near-constant load regardless of worker bursts.

The Two Weapons Nobody Talks About: ML Capacity Forecasting + P0 Service Tiers

If you read the engineering blogs from Dream11, Hotstar, and JioCinema carefully, you'll notice two patterns that aren't in any Kubernetes tutorial — and these are what really let them survive events like the IPL final.

Weapon #1: Their AI/ML Team Tells the Infra Team How Much to Scale

Here's something most engineers don't realise. Dream11, Hotstar, and JioCinema don't just "pre-scale for IPL." They pre-scale for this specific match, based on numbers a machine-learning team generates days in advance.

The ML team looks at things like:

Historical match data — how much traffic did IND vs PAK bring last year? How about IND vs Afghanistan? Not every match is equal.
Social media sentiment — how much is this match being talked about on Twitter/X, Reddit, YouTube in the days leading up?
User behaviour signals — app opens, contest joins, reminders set, push-notification opt-ins for this match.
External factors — weather (rain = fewer streams), time of day, weekday vs weekend, holiday calendar.

Out of this, the model outputs one simple number: "expected peak concurrent users for this match = 28 million." The infra team scales to ~1.5× that number an hour before kickoff. Done. No metric-watching, no guessing.

This is predictive scaling done properly. Not the half-baked "let the autoscaler look at last 15 minutes" version. The input is humans and ML forecasting days ahead, not a controller watching CPU in real time. The infra team treats it like a weather forecast: you don't react to rain, you carry an umbrella because the forecast said so.

Weapon #2: P0 / P1 / P2 Service Tiers — Turn Off What Doesn't Matter

The second technique is even simpler and more powerful: not every feature in the app has the same importance. When traffic spikes, you keep the critical stuff alive and turn off everything else.

Every service is tagged with a priority:

P0 — cannot fail. Stream playback. Contest join. Payment. Login. If this breaks, the business breaks.
P1 — important but non-essential. Recommendations, leaderboards, match stats overlay, comments.
P2 — nice-to-have. Personalised home feed, "friends are watching," social share counts, trending contests carousel.

When load crosses a threshold (or the "Panic Button" is pressed by an SRE), P1 and P2 services get shed automatically. The recommendation API returns a cached default. The comments section shows "comments temporarily unavailable." The personalised feed becomes a generic feed. The stream keeps playing. The contest keeps accepting submissions.

This is what Hotstar calls their "Panic Mode". It's a feature, not a bug. They designed the system to degrade gracefully under load. When an SRE presses the panic button (or it auto-triggers above a threshold), the whole fleet sheds P1/P2 services within seconds and focuses every CPU cycle on P0.

The takeaway: on a spiky day, the question isn't "how do I scale everything?" — it's "what's the smallest set of services that absolutely must survive, and can I turn off everything else?" Dream11 does this. Hotstar does this. JioCinema does this. So do Stripe, Shopify on Black Friday, and every payment platform during holiday sales.

Putting It Together: How a Real IPL Day Actually Runs

Here's the real playbook, end-to-end, as documented in the engineering blogs:

Where Reactive Scaling Actually Works

Reactive isn't broken. It's just oversold. Here's where it genuinely shines:

The Playbook — What to Do Instead

There is no one scaling strategy. There is a ladder. For any workload, walk it in this order.

A Note on Databases

Everything above is about stateless tiers. Databases are different. You don't "reactively scale" a primary RDS instance at 10:00:05. You pre-provision, you add read replicas ahead of time, you shard before the shard is hot. The one form of reactive scaling databases support well is storage autoscaling, and even that has a cooldown.

For databases, the rule is simpler: the fix is always upstream. Cache reads (Redis, CDN, materialized views). Queue writes. Batch. Deduplicate. If your DB is the thing you're trying to scale reactively, you've already lost.

Rule of thumb: stateless tiers scale horizontally; stateful tiers scale vertically until they can't, then they shard. Nothing about that requires HPA. HPA is an optimization on the stateless tier for a specific traffic shape.

Key Takeaways

Scaling has two axes: direction and trigger. Most outages are wrong-trigger, not wrong-direction. Pick each independently.
Reactive scaling takes 90 – 500 seconds. Spikes peak in 10 – 30. The math doesn't work — not for any amount of HPA tuning.
Know your spike shape. Step, thundering herd, wave, sawtooth. Each needs a different strategy. Wave is the only one reactive handles cleanly.
If it's on a calendar, schedule it. CronHPA, scheduled ASG actions, pre-warming. Reactive is the wrong trigger for known events.
If you caused the spike, dampen it. Client jitter, cohorted push, queues. The cheapest spike is the one you never created.
Buffer aggressively. Edge caches absorb reads. Queues absorb writes. Both convert steps into waves — and HPA is fine on waves.
Shed with 429, not silence. Explicit rejection with Retry-After protects the system. Silent queueing turns into timeouts turns into retry storms.
Databases don't scale reactively. Stop trying. Fix it upstream with caches, queues, and pre-provisioned replicas.
HPA is an optimization, not a strategy. It handles drift on a well-shaped traffic curve. It is not a plan for surviving spikes.

The best scaling architecture is the one where your metric never has to catch up — because the spike was flattened before it reached the thing that scales.