Scale Out or Size Right? A Guide to Horizontal Scaling and Pod Right-Sizing

Table of contents

Say what you will about the countless software orchestration frameworks out there- but if you’re indexing on raw flexibility, Kubernetes is the king of containers. If your goal is to run serious software at scale, there’s practically nothing it can’t do.

But that impressive flexibility comes at a cost: impressive complexity. Particularly in terms of workload autoscaling.

At Thoras we’ve spent a lot of time thinking about workload autoscaling. And we know that understanding when and how to autoscale your workloads can feel overwhelming, so we’ve put together this guide to demystify some of that complexity. We hope you find something useful!

35k Foot Overview

Generally speaking, there are two high-level approaches when it comes to autoscaling workloads running in Kubernetes. And it’s important to note that while there is some overlap, they’re usually solving different problems.

We're talking about the Horizontal Pod Autoscaler (HPA) and pod right-sizing.

Which approach should you use? When? Why? And most importantly, how the hell do I avoid getting paged at 3am? 😅

Let's break it down y’all!

So… What's the Deal with HPA?

The Horizontal Pod Autoscaler is probably what comes to mind when most people think of Kubernetes autoscaling. HPA's job is simple: add or remove pods based on observed metrics like CPU, memory usage, or your own custom metrics.

HPA watches your metrics and says "hey, we're getting hammered right now, let's spin up more pods!" or "things are quiet, let's scale down to save money." It's all about pod quantity. Spin up more pods when needed, spin them down when not.

When HPA is the bees knees:

  • When a workload is sensitive to sudden, unpredictable traffic spikes– think product launches, Black Friday, or that one time your startup got mentioned on Hacker News
  • Stateless workloads that can easily handle running multiple copies in parallel
  • When a workload is a better fit for wide, fanout throughput

Some HPA Drawbacks:

  • It's reactive. By the time HPA notices you need more capacity and spins up pods, your users might already be experiencing degraded performance. And if your pods take 30 seconds to come online, that's 30 seconds of potential pain (not to mention your node autoscaler may have  to spin up nodes as well, which has its own delay
  • HPA optimizes replica counts, not resource efficiency. HPA never asks “do the pods I’m adding have sensible resource allocation settings?” Scaling to meet demand is much less impressive when you’re needlessly lighting your AWS on fire to achieve it.

HPA Tips

1. Don't just scale on CPU. It’s super common to set up HPA on CPU utilization and call it a day. 

But we’ve found that a lot of the time, due to both its volatility and compressibility, CPU isn’t always the best signal for scaling pods! If you're running a web service, requests per second might be a more effective signal. Running workers? Queue depth is probably a great metric to look at as well.

The beauty of HPA is you can scale on pretty much any metric you expose. Got Prometheus? Use the Prometheus Adapter to scale on business metrics that have strong correlation to usage. Your future self will thank you when replica count graphs are flatter and lower.

2. Tune your scale-up and scale-down behaviors

For a long time now (since 1.18), HPA has come with knobs you can turn to control how aggressively it scales via spec.behavior.

The defaults are actually pretty reasonable, but understanding what's happening under the hood helps you tune them to match your workload's needs.

Here's what HPA does by default:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300 # Waits 5 minutes before scaling down
    policies:
    - type: Percent
      value: 100 # Can remove 100% of pods (down to minReplicas)
      periodSeconds: 15
  scaleUp:
    stabilizationWindowSeconds: 0 # Scales up immediately
    policies:
    - type: Percent
      value: 100 # Can double your pods
      periodSeconds: 15 # Every 15 seconds
    - type: Pods
      value: 4 # Or add 4 pods
      periodSeconds: 15
    selectPolicy: Max # Uses whichever allows more aggressive scaling

The 5-minute scale-down stabilization window is there to prevent thrashing. You don't want to scale down during a brief lull only to immediately scale back up when traffic returns. But if you know your traffic patterns are spiky and unpredictable, you might want to be more conservative:

behavior:

  scaleDown:

    stabilizationWindowSeconds: 600  # Wait 10 minutes instead of 5

    policies:

    - type: Pods

      value: 2  # Remove max 2 pods at a time

      periodSeconds: 60  # Every minute

Pro tip: if your workload is expensive to churn (lots of state to warm up, costly initialization), lean conservative on scale-down. If you're running stateless services with fast startup times and variable traffic, you can be more aggressive on both ends.

3. Scale on usage predictions

In our opinion, the most painful HPA drawback is its reactive nature, which can be too slow - risking service degradation. Add glacier speed pod and node startup times into the mix and you’re forced to seriously overprovision to compensate. Wasting compute and AWS credits shouldn’t be the solution to autoscaling in 2026.

At Thoras, we believe the solution is being able to scale on your usage predictions in parallel with your real-time usage. Whichever one is higher. When pods and nodes are online before they’re even needed, warmup time becomes a problem of the past. Achieving higher reliability and cost efficiency no longer needs to be an either/or scenario.

Interested to try out predictive scaling? Don’t hesitate to reach out to info@thoras.ai and we can get you started right away!

Ok… What About Pod Right-Sizing

Right-sizing is about ensuring workload pods are allocating the exact resources they need. So you’re not paying for idle capacity nor accidentally starving critical paths.

Unlike HPA, pod right-sizing focuses on the size of the pod, not the quantity of pods - optimizing more for efficiency, rather than elasticity.

When pod right-sizing shines

  • Addressing chronic over-provisioning. Real talk, most workloads are over-allocated by default. Teams pad CPU and memory “just to be safe,” and those values rarely get revisited! Sound familiar 😅? We’ve all done it
  • Some workloads are stateful and changing replica counts would straight-up break the system- HPA is out of the question.
  • When the workload isn’t highly sensitive to (or likely to encounter) severe unpredictable spikiness.

Right-sizing drawbacks

  • Just like HPA, traditional right-sizing methods are reactive, which may not work fast enough
  • Rolling out right-sizing configurations across many workloads can take a lot of coordination, especially if your delivery workflows are fragmented.

Right-sizing tips

1. Set sane min and max bounds per workload. Give right-sizing software guardrails. You can always widen to a more aggressive window as you gain confidence in your tool

resourcePolicy:

    containerPolicies:

    - containerName: "server"

      minAllowed:

        cpu: 128m

        memory: 256Mi

      maxAllowed:

        cpu: 1

        memory: 2Gi

2. Start with recommendations only

Set your initial VPA objects to Off mode rather than Auto. This lets you see what VPA recommends without it actually modifying your pods. Review the recommendations for a few days to understand the patterns before enabling auto-updates.

4. Optimize safer bin-packing by right-sizing based on future usage

Just like its HPA cousin, in our not-so-humble opinion, the fundamental reactive nature of the traditional VPA controller is its core limitation. Right-sizing is slower and riskier when you’re only reacting to what’s already happened.

When you’re able to allocate pod resources based on your future usage, it’s much safer since resources will be provisioned before they’re needed, not after.

The end result is a more reliable and cost-efficient cluster- both at the same time!

Interested in trying out predictive right-sizing? Hit us up at info@thoras.ai, we’ll have you scaling in just a few minutes! ⚡

By the way, did we mention Thoras supports in-place pod rightsizing? :)

In Summary

Like anything in engineering, when exploring autoscaling workloads in Kubernetes, we should understand the problems we’re solving and pick the best tool for the job.

HPA shines at scaling up workloads rapidly when traffic is unpredictable. It’s great at helping stateless workloads manage through volatile bursts of load, but it’s not going to do much to help inefficiently configured workloads.

Right-sizing, on the other hand, has a greater emphasis on optimizing efficiency. It keeps workloads right-sized, and reduces waste - but may be less ideal for bursty spikes.

The real win comes from understanding the tradeoffs and using each tool intentionally. And when you combine the two thoughtfully, you get a system that’s both resilient and cost-effective — without lighting your cloud bill on fire.

That’s the sweet spot!

Interested to give predictive HPA or pod right-sizing a shot? Don’t hesitate to reach out to info@thoras.ai, we’d love to chat!

Trial Thoras for free, and start scaling immediately.

Read More Blogs

News
Human-Centered Pull Requests: What Most PR Advice Is Missing

Read more
News
Fear, Uncertainty, and ...Utilization?

Read more
News
Scale Out or Size Right? A Guide to Horizontal Scaling and Pod Right-Sizing

Read more