Why Your Autoscaling Isn't Working The truth about spiky traffic — and how Dream11, OTT platforms, and ticketing sites actually survive it requests / sec actual traffic reactive capacity SPIKE — users throttled SPIKE — timeouts & 5xx capacity arrives late Traffic moves in seconds. Pods, nodes, and instances move in minutes. That delta is where outages live.

First, What Does "Reactive" Even Mean?

Before we go further, let's lock down the word reactive, because the whole blog is built on it.

Reactive scaling works like a thermostat. It doesn't predict anything. It doesn't know a cricket match is starting. It just watches a number — CPU, memory, requests per second, queue depth — and reacts after that number crosses a line.

Reactive = "only after the event happens, then act." Traffic has to spike first. Then the metric goes up. Then the controller notices. Then it asks for more pods. Then the pods boot. Only then does capacity show up.

That's why reactive is always late. By definition, it cannot act before the event — it has to see the event first.

Now compare that to the others:

  • Scheduled — "every Friday at 7pm, scale to 400 pods." The clock is the trigger. It doesn't need to see traffic.
  • Predictive — "based on last 30 days of traffic, the system forecasts what's coming and scales ahead." A forecast is the trigger.
  • Manual — "it's IPL final, please scale us up." A human is the trigger.

Only reactive waits for traffic to actually arrive before acting. Which is exactly why it fails on spikes.

The One Thing People Get Wrong About Scaling

Engineers treat scaling as a single concept. It isn't. "Scaling" is actually two orthogonal decisions stacked on top of each other:

  1. Direction — do you add more machines (horizontal) or bigger machines (vertical)?
  2. Trigger — what causes the scale action? A metric you observe (reactive), a forecast (predictive), a clock (scheduled), or a human (manual)?

You can mix these. Horizontal + reactive is what Kubernetes HPA does out of the box. Horizontal + scheduled is what you do before a known sale. Vertical + manual is what a DBA does at 2am when the DB is melting. The failure mode you meet in production is usually the wrong trigger, not the wrong direction.

Let's lay out the whole map before diagnosing what breaks.

The Scaling Map — Direction × Trigger HORIZONTAL add more boxes VERTICAL bigger box EXAMPLES REACTIVE metric → scale HPA, ASG target-tracking fails on spikes — 60–300s lag RDS storage autoscale usually downtime to resize K8s HPA, AWS ASG GCP MIG PREDICTIVE forecast → pre-scale ML/forecast-driven HPA needs stationary traffic rare in practice forecast the next DB size AWS Predictive Scaling KEDA + Prophet SCHEDULED clock → pre-scale cron-based scale-out best for known events planned DB upgrade maintenance window K8s CronHPA, ASG scheduled actions MANUAL human → scale SRE triples replicas during an incident DBA bumps instance size failover to larger replica kubectl scale, Terraform runbook execution
Mental model: direction decides what scales (CPU? memory? connections?). Trigger decides when. Most production outages I've seen are not a wrong-direction problem. They're a wrong-trigger problem — usually reactive where reactive doesn't work.

Horizontal vs Vertical — The Part Everyone Knows

Quick refresher, because the rest of the post leans on it.

Horizontal vs Vertical Scaling VERTICAL (scale up) 2 vCPU 16 vCPU $$$ One bigger box + simple: no distributed concerns − hard ceiling, downtime to resize − SPOF HORIZONTAL (scale out) pod pod pod pod pod pod Many small boxes + linear (sort of) to infinity + fault tolerant − state, sessions, coordination, warm-up

In 2026, most production web workloads default to horizontal. Stateless HTTP services, workers, frontends, gRPC — all horizontal. Vertical is still how you scale most databases (until you shard), some caches, and anything with per-process state that's expensive to move.

The interesting question isn't direction. It's this: how do you decide to add or remove capacity?

Reactive Scaling: The Default Everyone Uses

Reactive scaling is simple and elegant. You pick a metric (CPU, memory, requests/sec, queue depth), set a target (70% CPU), and let the controller add or remove replicas to hit it.

Kubernetes HPA is the poster child. So is AWS target-tracking ASG. The control loop looks like this:

The Reactive Scaling Loop — Why It's Always Late 1. OBSERVE Metric scraped every 15–60s lag: 15–60s 2. DECIDE Cooldown / stabilization window lag: 30–180s 3. PROVISION New pod/node: pull image, schedule lag: 30–120s 4. WARM UP App boot, JIT, warm cache, connect DB, readiness probe lag: 15–180s Total: 90 – 500+ seconds from spike to new capacity Meanwhile, a flash sale or notification fan-out: 0 → peak load in 10 – 30 seconds

This loop is fine when traffic grows smoothly. HPA's default sync period is 15 seconds, with a 5-minute stabilization window on scale-down (so pods aren't removed too quickly) and no stabilization on scale-up — but scale-up still can only grow at most +100% or +4 pods per 15-second window by default. ASG target-tracking evaluates CloudWatch alarms (1-minute metric granularity by default), and on top of that you pay 60–180s for the EC2 instance to boot and register with the cluster.

Add those up:

  • Metric propagation: 15 – 60s (metrics-server scrape interval, Prometheus scrape, CloudWatch 1-min granularity)
  • Decision + cooldown: 30 – 300s (HPA 15s sync + stabilization windows; ASG alarm evaluation periods)
  • Provisioning: 30 – 180s (image pull, node provisioning if cluster itself is full, pod scheduling)
  • Warm-up: 15 – 180s (app start, JIT, cache fill, readiness probe passing, connection pool ramp-up)

Best case: ~90 seconds. Worst case (node scale + cold image pull + slow JVM start + cache warm): 5 – 8 minutes. If your traffic peaks in 30 seconds, you've already served your users 500s, retries, and timeouts by the time the pods are ready.

The brutal truth: reactive scaling assumes the thing you're trying to absorb moves slower than the thing that absorbs it. Spiky traffic violates that assumption.

What "Spiky" Actually Means

Not all fast-growing traffic is a spike. A hockey-stick pattern over a day is not a spike. A 2x diurnal peak is not a spike. For this discussion, a spike is traffic whose rate of change outruns your scaling loop's time-to-capacity.

Empirically, most production "spikes" we see fall into one of four shapes:

Four Shapes of Spiky Traffic 1. THE STEP flash sale opens, OTT match 0 → peak in seconds stays high for minutes–hours 2. THUNDERING HERD push notification fan-out seconds-long peak returns to baseline fast 3. THE WAVE viral moment, news event minutes–hours to peak reactive can handle (barely) 4. SAWTOOTH cron jobs, batch kickoffs regular pulses HPA flaps scale up/down Reactive scaling only works cleanly on shape 3. Shapes 1, 2, and 4 need a different strategy.

Each shape breaks reactive scaling in a slightly different way:

  • Step — reactive eventually catches up, but you've taken minutes of 5xx and timeouts before it does. CPU is not even the right signal here; request queue depth is.
  • Thundering herd — by the time new capacity is warm, the spike is already over. You scaled out for nothing, then scale back in, then the next push hits. You're always fighting the last war.
  • Wave — the sweet spot for reactive. Gradual enough for the loop to keep up.
  • Sawtooth — HPA adds pods on the peak, removes them during the trough, then adds them again. Flapping. You're paying for capacity twice and paging yourself every cycle because readiness is bouncing.

Case Study 1: Dream11 on IPL Final Day

Spike shape: Step (deadline-driven)  |  What broke: Reactive HPA + DB writes  |  What solved it: Scheduled pre-scale + horizontal DB sharding + async queue for writes. Reactive scaling is only the safety net.

It's 7:25pm. IPL final day. CSK vs MI. Toss is in 5 minutes. Across India, millions of users open Dream11 at the same time to lock in their team before the deadline. If you've ever used the app, you know the feeling — everyone waits till the last moment.

So what's the scaling problem here? Simple: every user shows up in the same 60-second window. HPA can't save you. Dream11 knows this, and they've written about how they solve it.

Dream11 — IPL Final, Last 5 Minutes Before Toss baseline TOSS (deadline) APP DEGRADES team-save API times out real traffic reactive capacity (HPA/ASG) T-5m T-2m T=0 T+2m 15M users in last 5 min ~20x normal concurrent load team save = database WRITE cache doesn't help, DB is the bottleneck Solution: pre-scale + shard + queue scheduled hours before the match

Why IPL Final Traffic Is Different

A normal day on Dream11 is a gentle wave — morning ramp, evening peak. HPA handles it fine. IPL final breaks three assumptions at once:

  • Hard deadline. You cannot submit after toss. Every user piles into the last 5 minutes. The spike is not a bug — it's baked into how the product works.
  • Writes, not reads. Saving a team hits the database. CDNs can't cache a write. You can queue it, but someone still has to store every team.
  • Real money. "Try again" is not acceptable when there's money on the line.
How Real-Money Fantasy Platforms Actually Scale for IPL 2 hours before toss Scheduled scale-out: API tier 40 → 400 pods, worker tier 20 → 200, DB read replicas +8 Trigger: cron. Nobody asked metrics. The schedule is the trigger. 30 minutes before toss Caches pre-warmed with match metadata, player stats, contest lists. All reads served from Redis/edge. 95%+ of traffic is reads on IPL day — serve them without touching the DB. During the last-5-min storm Team-save writes go into a Kafka queue, not directly to DB. User sees instant "Team saved ✓" — the DB write happens asynchronously. Queue absorbs spike. DB sees a smooth stream at its sustainable rate. DB sharded by user_id No single DB box sees all 15M writes. Writes fan out across 32+ shards, each handling its slice. Horizontal scaling at the stateful tier — pre-provisioned. If something still saturates — shed with grace Explicit 429 + "Please retry in 3s" beats a timeout. The app auto-retries once; user sees a spinner, not an error. Result: smooth IPL final, no outage — and HPA is barely involved in any of this

The Key Mindset Shift

If you're an engineer scaling something for a Dream11-style workload, the question you need to ask is not "how fast can my HPA react?" It's:

"What do I know about this spike in advance, and how do I make my system behave like it's a normal day?" For Dream11, the whole game is to turn the IPL-final step function into something that looks like Tuesday afternoon to the actual database. Pre-scale. Pre-warm. Queue. Shard. Shed. HPA is the safety net, not the strategy.

This pattern repeats everywhere — Zomato during Indian Cricket wins, Zerodha at 9:15am market open, Swiggy at 8pm Sunday. Known spike on a calendar = scheduled scaling, not reactive.

Sources — Dream11 Engineering Blog
  • Dream11's engineering team publishes at engineering.dream11.com — read their posts on handling IPL-day traffic, Kafka-based write absorption, and sharded Postgres architecture.
  • Search for "Dream11 scaling IPL" on Medium — multiple deep-dives from their SREs walking through pre-scale playbooks, contest-list cache warming, and circuit breakers.
  • Talks from Dream11 at events like Rootconf and AWS re:Invent cover the same playbook — watch for "Dream11 fantasy sports scale" on YouTube.

Case Study 2: Hotstar and JioCinema — Scaling IPL Live Streaming

Spike shape: Step + "wow moments" inside it  |  What broke: Reactive scaling cannot move fast enough for 2M → 15M in one over  |  What solved it: Scheduled pre-scale + CDN pre-positioning + client-side jitter + graceful shedding (Panic Mode). Reactive HPA only handles drift.

If Dream11 is the write-heavy spike, live-streaming an IPL match is the read-heavy spike. India has produced the two biggest examples the internet has ever seen — Hotstar in 2019 and JioCinema in 2023. Both teams have spoken openly about how they did it, which is why we can learn from real numbers instead of guessing.

Hotstar's India vs New Zealand World Cup 2019 semi-final peaked at 25.3 million concurrent viewers — a world record at the time. Their engineers have publicly described "wow moments" during big matches (a wicket, a six, Dhoni walking in) where concurrency can double or more inside a single over — adding several million viewers in just a few minutes.

JioCinema broke that record during the IPL 2023 final at a reported 32 million concurrent viewers. The techniques are nearly identical to Hotstar's — because reactive scaling cannot solve this shape of spike, full stop.

Hotstar IPL / World Cup — The "Wow Moment" Spike 0 5M 15M 25M WICKET! / SIX! ~2M concurrent (pre-match buildup) 15M+ in one over (peak: 25.3M) pre-match innings ramp moment settles y-axis: concurrent viewers 2M → 15M in a single over (~4 min) 25.3M peak Hotstar, IND v NZ, WC 2019 32M peak JioCinema, IPL 2023 final 0 complete outages because they never trusted HPA alone

What They Actually Did (From Public Engineering Talks)

The Hotstar engineering team has spoken about this publicly at events like Scale by the Bay and in their blog. The architecture boils down to five moves — every one of them is about not relying on reactive scaling:

How Hotstar / JioCinema Survive 25M+ Concurrent 1. Pre-scale to predicted peak hours before kickoff Match schedule is known. Capacity plan based on historical matches of similar significance. World Cup semi against NZ? Plan for 30M, even if baseline is 500K. Reactive isn't in the conversation. 2. Aggressive CDN + edge caching Video segments (HLS/DASH chunks) are pre-pushed to edge PoPs. 99%+ of video bytes never touch origin. Spike of 25M viewers = spike on CDN, not on origin. The system the user hits is the fridge, not the kitchen. 3. Client-side jitter and exponential backoff Mobile apps stagger their manifest refreshes with random jitter. Retries use exponential backoff + cap. Without this, 15M clients would all hammer /manifest at exactly the same second. With it, the load is smooth. 4. The "Panic Button" — graceful degradation as a feature Kill switches for heavy features: comments, stats overlays, personalisation. When load crosses thresholds, non-critical services auto-shed. The stream keeps playing even if "trending contests" disappears. 5. Over-provision — it's cheaper than an outage For a known marquee match, Hotstar would provision for 2x the predicted peak and eat the cost. Brand damage from one bad match day >> cost of idle servers for a few hours. Reactive cannot buy you that safety.

Where Reactive Scaling Fits in This Picture

It doesn't, really. HPA on the Hotstar origin layer exists — but it's there to handle the drift between the pre-scaled baseline and actual traffic, not to absorb the spike. If HPA has to do real work during a match, something has already gone wrong with the plan.

The pattern to remember: read-heavy spikes (Hotstar, JioCinema, news sites, viral content) are solved with CDN + pre-scaling + client jitter. Write-heavy spikes (Dream11, payments, ticketing) are solved with pre-scaling + sharding + queues. In both cases, reactive scaling is a safety net, not the plan.
Sources — Hotstar & JioCinema Engineering
  • Hotstar's widely-shared engineering post "How Hotstar Scales During IPL" — available on their tech blog and Medium, covers the 25.3M concurrent milestone, the "Panic Mode" shedding system, and their CDN strategy.
  • Hotstar engineers have spoken at QCon, InfoQ, and Scale by the Bay — search for "Hotstar scale 25 million concurrent" on YouTube for the original talks with architecture diagrams.
  • JioCinema's engineering team has published on Medium and spoken at industry events about hitting ~32M concurrent during IPL 2023 — search "JioCinema scaling IPL 2023" for the deep-dives on their CDN pre-positioning and client-side jitter.
  • CDN and cloud-provider post-event reports (Akamai, AWS reInvent sessions) often reference the JioCinema IPL peak traffic numbers — useful for independent verification.

Case Study 3: The OTT Launch That Cratered in 90 Seconds

Spike shape: Step (hard launch time)  |  What broke: Pure reactive HPA — metric lag + cold JVM starts + cluster autoscaler = 8 min to full capacity, spike took 17 seconds  |  What solved it: Scheduled pre-scale 15 min before launch + aggressive edge cache. Reactive now only handles drift.

A streaming platform was launching a tentpole show. Their capacity plan was "HPA target 60% CPU, min 40 replicas, max 400." They load-tested at 3x baseline. Launch went out at 8:00pm IST. The home feed service died at 8:00:47pm.

What happened?

Case Study 3 — OTT Launch Spike (anonymised) actual traffic: 60k → 780k rps HPA ramps — too late OUTAGE WINDOW 3 min of 5xx -5m T=0 (launch) T+3m T+10m 13x spike in 17 seconds 60k → 780k rps HPA fully ramped at T+8m 3 min of outage + 5 min degraded Fix: pre-scale + edge cache no outage on next launch

The post-mortem showed three independent reasons HPA couldn't save this:

  1. Metric lag. CloudWatch custom metrics were on a 60s resolution. By the time HPA saw CPU at 90%, traffic had already been there for 45 seconds.
  2. Cold starts. The service was JVM-based. P50 warm-up was 55s, P99 was 2m10s (classloader + JIT + connection pool). Pods got scheduled fast; they weren't serving fast.
  3. Cluster autoscaler. They hit node capacity at T+40s. Cluster autoscaler kicked in, but EC2 instance boot + kubelet registration + image pull took another 3 minutes.

HPA was doing its job correctly. Its job just took 8 minutes. The launch took 17 seconds.

What Actually Fixed It

They did not tune HPA harder. They changed the architecture:

The Fix — Three Layers Before HPA Matters 1. SCHEDULED PRE-SCALE CronHPA: scale to 250 replicas 15 min before launch time Pays for idle capacity, absorbs step 2. EDGE CACHE / CDN Home feed cached at edge, 30s TTL + stale-while-revalidate 85% of requests never hit origin 3. HPA Handles drift from pre-scaled baseline Now operating on a wave, not a step Result: next launch peaked at 1.1M rps — zero 5xx Cost of pre-scale: ~12% above baseline, only for ~30 min per launch

Key idea: if the event is on a calendar, scaling shouldn't wait for a metric. Reactive is the wrong trigger for known events. Scheduled pre-scale, combined with a cache layer that flattens the step into a trickle, turns this from a capacity problem into a cache problem — and caches are much easier to scale.

Case Study 4: Notification Fan-Out — The Spike You Create Yourself

Spike shape: Thundering herd (seconds-long peak)  |  What broke: Reactive HPA scaled stateless tier but DB connection pool saturated; client retries amplified load  |  What solved it: Client jitter + staged push cohorts + queue in front of DB + edge cache. The fix is not "scale faster" — it's "never let the peak happen."

A ticketing platform sent a push notification to 18 million users at 10:00:00 AM sharp announcing ticket drops. Their app opened a screen that hit three APIs: /events, /availability, /recommendations. HPA kicked in and actually scaled up. That wasn't the problem. The problem was that by 10:00:12, every downstream service was dead — including the database.

Case Study 4 — Notification Thundering Herd PUSH SEND 18M devices, T=10:00:00 MOBILE APPS OPEN 14M opens in 8 seconds 3 API calls each = 42M rps peak API GATEWAY saturated — 503s BACKEND + DB HPA adding pods but DB connections exhausted App timeout → auto-retry → more load Timeline: T+0s push sent T+3s 42M rps hit API T+12s DB conn pool dead T+90s HPA ready (too late)

Three things went wrong even though scaling "worked":

  1. Scaling the stateless tier doesn't help if the stateful tier can't keep up. HPA added API pods. Each new pod opened 30 DB connections from its HikariCP pool. The DB had a hard cap of 2,000 connections. Adding pods made the DB worse.
  2. Client retry policies are load amplifiers. The mobile SDK retried failed calls up to 3 times. A 5xx became 4 requests. The spike was multiplied by the clients trying to escape it.
  3. The spike was over before HPA finished. By T+90s, the app opens were done. But the cluster was now running at 4x its needed size with a wrecked downstream DB.

The Fix: Don't Scale, Smooth

The right architecture for thundering herds is the opposite of "scale faster." It's dampening:

Fix — Smooth the Spike, Don't Chase It 1. CLIENT JITTER App waits 0–30s random before first fetch 42M rps → 1.4M avg 2. STAGED PUSH Send to 18M in cohorts over 5 minutes step → gentle wave 3. QUEUE + SHED SQS/Kafka in front of DB; 429 + backoff above N DB never exceeds capacity 4. CACHE response at edge 5s TTL >90% hit rate BEFORE 42M rps peak AFTER 2.4M rps peak, 5 min long wave
The counterintuitive takeaway: on a thundering herd, your job isn't to scale to the peak. It's to prevent the peak from happening. Client jitter, staged fan-out, queues, and edge caching reshape a step into a wave — and waves are what reactive scaling was designed for.

Case Study 5: Sawtooth — The HPA That Never Stopped Flapping

Spike shape: Sawtooth (periodic bursts)  |  What broke: Reactive HPA constantly adding/removing pods in rhythm with cron — pods flap, costs 2.3× higher, p99 doubles  |  What solved it: Scheduled floor via CronHPA + cron-worker jitter + queue to decouple workers from API. Fight the forcing function, not the HPA.

A data platform ran ETL cron jobs every 5 minutes that each triggered a burst of API calls from workers to a metadata service. HPA saw CPU > 70%, scaled from 10 → 40 pods. 90 seconds later the burst was over. HPA's scale-down stabilization window was 5 minutes, so right as it was about to remove pods, the next cron fired and it needed them again. Except the new cron fired at the exact moment scale-down was finally removing pods.

Result: pods constantly being added/removed, readiness probes flapping, tail-latency p99 doubled, and a spend 2.3x higher than necessary. No outage. Just constant pain.

Sawtooth — HPA Flapping Against a Cron Pattern cron every 5 min → traffic (red) HPA replicas (blue) — chasing forever time → HPA is a feedback loop; cron is a periodic forcing function. Feedback loops oscillate when the period of the forcing function matches their own.

Fixes, in descending order of how often they actually work:

  1. Schedule the capacity — a cron-based autoscaler (KEDA's cron scaler, or the kubernetes-cronhpa-controller) keeps a floor of 25 replicas for the 90 seconds around each cron tick. Effectively pause HPA's reactive component during that window.
  2. Smooth the source — spread the cron workers' start over 30 seconds with jitter so they don't bang the API simultaneously.
  3. Right-size the stabilization window — bump scale-down stabilization so HPA doesn't remove pods during the trough. Accept slightly higher steady-state cost in return for no flapping.
  4. Use a queue — decouple workers from metadata service with a queue. Now the metadata service sees near-constant load regardless of worker bursts.

The Two Weapons Nobody Talks About: ML Capacity Forecasting + P0 Service Tiers

If you read the engineering blogs from Dream11, Hotstar, and JioCinema carefully, you'll notice two patterns that aren't in any Kubernetes tutorial — and these are what really let them survive events like the IPL final.

Weapon #1: Their AI/ML Team Tells the Infra Team How Much to Scale

Here's something most engineers don't realise. Dream11, Hotstar, and JioCinema don't just "pre-scale for IPL." They pre-scale for this specific match, based on numbers a machine-learning team generates days in advance.

The ML team looks at things like:

  • Historical match data — how much traffic did IND vs PAK bring last year? How about IND vs Afghanistan? Not every match is equal.
  • Social media sentiment — how much is this match being talked about on Twitter/X, Reddit, YouTube in the days leading up?
  • User behaviour signals — app opens, contest joins, reminders set, push-notification opt-ins for this match.
  • External factors — weather (rain = fewer streams), time of day, weekday vs weekend, holiday calendar.

Out of this, the model outputs one simple number: "expected peak concurrent users for this match = 28 million." The infra team scales to ~1.5× that number an hour before kickoff. Done. No metric-watching, no guessing.

How the ML Team Feeds the Infra Team a Number Historical matches IND v PAK = 28M Social sentiment Twitter, Reddit, YouTube User signals reminders, contest joins External factors weather, holiday, weekday Marketing pushes ad spend, notif schedule ML FORECAST MODEL (owned by data science team) regression / gradient boosting retrained after every match PEAK ESTIMATE ≈ 28M users delivered to infra team 48h before Infra pre-scales to 1.5× = 42M capacity pods, DB replicas, CDN contracts, edge PoPs
This is predictive scaling done properly. Not the half-baked "let the autoscaler look at last 15 minutes" version. The input is humans and ML forecasting days ahead, not a controller watching CPU in real time. The infra team treats it like a weather forecast: you don't react to rain, you carry an umbrella because the forecast said so.

Weapon #2: P0 / P1 / P2 Service Tiers — Turn Off What Doesn't Matter

The second technique is even simpler and more powerful: not every feature in the app has the same importance. When traffic spikes, you keep the critical stuff alive and turn off everything else.

Every service is tagged with a priority:

  • P0 — cannot fail. Stream playback. Contest join. Payment. Login. If this breaks, the business breaks.
  • P1 — important but non-essential. Recommendations, leaderboards, match stats overlay, comments.
  • P2 — nice-to-have. Personalised home feed, "friends are watching," social share counts, trending contests carousel.

When load crosses a threshold (or the "Panic Button" is pressed by an SRE), P1 and P2 services get shed automatically. The recommendation API returns a cached default. The comments section shows "comments temporarily unavailable." The personalised feed becomes a generic feed. The stream keeps playing. The contest keeps accepting submissions.

Service Tiering — What Stays On, What Gets Killed NORMAL DAY every service running P0 Stream, payment, login cannot fail, ever P1 Recommendations, stats, comments important but sheddable P2 Personalised feed, trending carousel nice-to-have, first to go spike hits panic mode triggers IPL FINAL / BIG SPIKE only P0 survives — on purpose P0 Stream, payment, login gets all the capacity P1 DEGRADED — cached fallback returned "stats unavailable" P2 TURNED OFF — feature flag killed generic feed instead Why this works so well Instead of scaling every service to survive the spike, scale only P0. Shed P1 & P2. Same infra handles 4-5× more P0 traffic — because P1/P2 aren't competing for CPU, DB, or bandwidth. User sees "stream works, contest works" → happy. Nobody notices comments are off during a wicket.

This is what Hotstar calls their "Panic Mode". It's a feature, not a bug. They designed the system to degrade gracefully under load. When an SRE presses the panic button (or it auto-triggers above a threshold), the whole fleet sheds P1/P2 services within seconds and focuses every CPU cycle on P0.

The takeaway: on a spiky day, the question isn't "how do I scale everything?" — it's "what's the smallest set of services that absolutely must survive, and can I turn off everything else?" Dream11 does this. Hotstar does this. JioCinema does this. So do Stripe, Shopify on Black Friday, and every payment platform during holiday sales.

Putting It Together: How a Real IPL Day Actually Runs

Here's the real playbook, end-to-end, as documented in the engineering blogs:

IPL Day — A Minute-by-Minute Reality Check T–48h ML peak estimate T–2h pre-scale triggered T–30m CDN + cache warm T=0 toss spike arrives wicket panic mode on What actually happens at each step: T–48h: ML model outputs "expected peak: 28M" → infra plans for 1.5× = 42M capacity. T–2h: CronHPA scales API tier 40→400, worker tier 20→200, DB read replicas +8 (scheduled, not reactive). T–30m: Match metadata, contest lists, player stats pushed to Redis + edge PoPs. 95% reads never hit origin. T=0: 20× spike. Queue absorbs writes. HPA handles drift from pre-scaled baseline — not the spike itself. Wicket: 2M → 15M in 4 min. Panic mode auto-trips. P1/P2 shed. Only P0 (stream + contest + payment) kept alive. Notice: reactive scaling never saved the day. It was predictions + schedules + priorities all the way down.

Where Reactive Scaling Actually Works

Reactive isn't broken. It's just oversold. Here's where it genuinely shines:

When Reactive Scaling Is Genuinely the Right Tool 1. Diurnal / slow growth patterns Morning ramp-up, evening wind-down. Your loop has minutes, not seconds. 2. Queue-backed workers (KEDA) Scale on queue depth. Work is buffered; latency is elastic by design. 3. Serverless (Lambda, Cloud Run) Provisioning lag measured in ms, not s. Reactive becomes fast enough for spikes. 4. Behind an aggressive cache Cache absorbs the spike; origin sees a muted signal that reactive can handle. Common thread: something upstream absorbs or slows the change. Reactive doesn't fail on spikes; reactive fails on spikes that reach it unattenuated.

The Playbook — What to Do Instead

There is no one scaling strategy. There is a ladder. For any workload, walk it in this order.

The Spiky-Traffic Playbook 1. Know the shape of your spike Step, herd, wave, sawtooth — look at last 30 days of traffic at 1-second resolution, not 1-minute averages. 2. If the spike is on a calendar, schedule capacity Ticket drops, OTT launches, cron jobs — CronHPA / scheduled ASG actions. Don't wait for a metric you already predicted. 3. If you cause the spike, dampen it at the source Client jitter, staged push, cohorted rollouts. The cheapest spike is the one you didn't send. 4. Put a buffer between client and origin Edge cache for reads. Queue for writes. Both convert spikes into waves. Both are cheaper than over-provisioning. 5. Shed load with grace Explicit 429 + Retry-After beats silent queueing that turns into timeouts. Failing fast protects the system you already have. Only after all of the above — then let HPA handle the residual drift.

A Note on Databases

Everything above is about stateless tiers. Databases are different. You don't "reactively scale" a primary RDS instance at 10:00:05. You pre-provision, you add read replicas ahead of time, you shard before the shard is hot. The one form of reactive scaling databases support well is storage autoscaling, and even that has a cooldown.

For databases, the rule is simpler: the fix is always upstream. Cache reads (Redis, CDN, materialized views). Queue writes. Batch. Deduplicate. If your DB is the thing you're trying to scale reactively, you've already lost.

Rule of thumb: stateless tiers scale horizontally; stateful tiers scale vertically until they can't, then they shard. Nothing about that requires HPA. HPA is an optimization on the stateless tier for a specific traffic shape.

Key Takeaways

  1. Scaling has two axes: direction and trigger. Most outages are wrong-trigger, not wrong-direction. Pick each independently.

  2. Reactive scaling takes 90 – 500 seconds. Spikes peak in 10 – 30. The math doesn't work — not for any amount of HPA tuning.

  3. Know your spike shape. Step, thundering herd, wave, sawtooth. Each needs a different strategy. Wave is the only one reactive handles cleanly.

  4. If it's on a calendar, schedule it. CronHPA, scheduled ASG actions, pre-warming. Reactive is the wrong trigger for known events.

  5. If you caused the spike, dampen it. Client jitter, cohorted push, queues. The cheapest spike is the one you never created.

  6. Buffer aggressively. Edge caches absorb reads. Queues absorb writes. Both convert steps into waves — and HPA is fine on waves.

  7. Shed with 429, not silence. Explicit rejection with Retry-After protects the system. Silent queueing turns into timeouts turns into retry storms.

  8. Databases don't scale reactively. Stop trying. Fix it upstream with caches, queues, and pre-provisioned replicas.

  9. HPA is an optimization, not a strategy. It handles drift on a well-shaped traffic curve. It is not a plan for surviving spikes.

The best scaling architecture is the one where your metric never has to catch up — because the spike was flattened before it reached the thing that scales.