1

I work in a company that is entirely on the cloud and we have started a stress testing project. The idea is to load everything in production to a new environment and run stress tests on it to find the total capacity of the system and where the bottlenecks are.

Now, I remember a time when we stress tests physical servers as well as private clouds and I remember that it was almost impossible to get a complete copy of production and all its moving parts. Also, even with stress testing tools like sysbench, Jmeter and ab, you never could exactly simulate traffic just like production.

We would usually monitor and profile production as much as we could, identify an issue and then try to fix that specific issue by simulating it in the stress testing environment.

To calculate capacity we used to (and some still do) use a calculation to predict when capacity will be met or if the response time is below satisfactory levels.

Considering that the project to recreate production and stress test it is quite time and resource consuming, is this the best way to go to find bottlenecks in the system and measure capacity, or is the "old" way better?

Jonathan
  • 451
  • 3
  • 9
  • They both have their advantages and will find different bugs. Unfortunately, this is a really broad question that is likely more art (and opinion) than science and you may not draw good answers for the Stack Exchange style format. – Matthew Wetmore Aug 11 '17 at 21:22
  • But, for what it's worth: I always start a project with individual unit/component tests, and never finish a project until it has focused customer scenario validation and benchmarking. – Matthew Wetmore Aug 11 '17 at 21:24
  • I am just saying that stress testing your system is inaccurate as you make a lot of assumptions as you are writing your tests. Example, a user will randomly click on a random button on the web page every 1-10 seconds. Then times that user to 200 concurrent users and see how your system reacts. Usually with stress testings in this way, you compare a before and after. You get a baseline, you make a change and you test it. I don't see how this large recreation of production can be an accurate test for real life traffic and future capacity. – Jonathan Aug 11 '17 at 22:33

1 Answers1

1

It is always better to stress the entire system and use a production-like environment (or even production).

First of all check out Can a proportionately scaled down testing environment find performance load issues? question and its answers.

An application’s underlying infrastructure is constructed of many different components such as caches, web servers, application servers and disks(I/O). Bandwidth and CDNs also play a role in its function and therefore have to be taken into consideration during scaling. Each component behaves differently in the application according to how it was configured and scaled. However, the tiered structure makes it difficult to calculate how each should be tested and scaled.

So if possible always go for the system testing in a real conditions. If it is not possible, you can still run load tests testing against a scaled-down environment, however don't expect you will be able to exactly extrapolate the results like this machine has 10Gb of RAM and is able to survive 1000 RPS, that machine has 20Gb of RAM therefore it will be 2000 RPS - it won't work this way.

Dmitri T
  • 551
  • 2
  • 2
  • My question was different, I am asking why is it better to stress test the entire system over monitoring the system and only stress testing certain bottlenecks when they occur. From your example: you said you have different components, different caches, CDNs and you don't know how they will behave when scaled. But you do know when they are reaching their capacity from your monitoring and you may be able to replicate a stress test more accurately for a single component than you can for the entire system. – Jonathan Aug 14 '17 at 09:36