0

When is horizontal scaling likely to solve your scaling problems?

Let's say you have single api node (no DB) and a desired goal of 10k RPS over 5 minutes where the p95 is < x ms. Requests are coming in and you start to see that p95 go above your x goal. If you don't see any clear metrics indicating poor application performance (>75% CPU, > 75% RAM, etc), is it safe to assume horizontal scaling is likely the solution?

At first I thought the answer was "yes", but then I saw this article. Vertically scaling a node application from a large to a xlarge AWS instance allowed it to go from 10k RPS to 25K RPS. How is that possible? CPU Utilization on the 10k test was around 10% (not that high). It's possible its memory but seems unlikely. Am I missing something? Or is horizontal scaling just cheaper than vertical scaling with the additional benefit of resiliency?

stk1234
  • 99
  • 2

1 Answers1

2

Generally speaking scaling up is a safe bet for solving issues due to increased load. Even the most crappy single threaded application will usually benefit from a faster CPU, anything disk bound will almost always benefit from more and faster storage and even if more RAM doesn't immediately benefit the application off-loading IO to memory usually helps too.

Generally speaking you need quite an intimate understanding of how an application functions (under load) to know beforehand if it will behave properly when you apply horizontal scaling, let alone if doing that will actually address any of the bottlenecks that make you consider scaling out.

For both it really helps to have proper performance metrics and to run load and stress tests. That is the only way to find your real bottlenecks, to see if tuning and configuration adjustments help make a difference and/or if more and/or better hardware are likely to make the most cost-efficient upgrade.

IMHO Often pointing out where the application hits its bottleneck and insisting on the developer fixing the whatever crap they released to production is a much better and resilient solution than either scaling up and/or out.

Rob
  • 1,175
  • 1
  • 7
  • Do you have resources you could pointe me to that lay out what "proper" load and stress tests look like? Including interpreting the results – stk1234 Jul 15 '22 at 03:32
  • https://en.wikipedia.org/wiki/List_of_performance_analysis_tools and for example https://www.gartner.com/reviews/market/it-infrastructure-monitoring-tools – Rob Jul 16 '22 at 08:31