We auto-scale our Elastic Beanstalk Java Application based on the average response times exceeding 3 seconds. When this happens we add 2 instances to our environment. Once we are back within 1.5 seconds average response time then we reduce by 1 instance, with a 300 second cooldown policy.
Our new endpoint is expected to take circa 60 seconds to respond, which kinda breaks our auto-scaling model because the averages will now be heavily skewed.
Our original objective was to detect when endpoints encountered latency (we call through to third-party APIs and proxy their results - so any delays are because third parties are timing out or taking longer than planned). To date the auto-scaling worked a treat.
What options are available to us when we introduce long-running requests?
Are we to look at programmatically increasing and decreasing the number of instances based on the latency of a subset of requests, e.g. average of 3 seconds for endpoint-a and endpoint-b, but an average of 70 seconds for endpoint-C ?
We could make an assumption that if 10% of the users are using the 60-second endpoint and the other 90% are using the 1-2 second endpoint, then we could attempt to set the average higher as a compromise, however, I fear this means we won't scale up early enough for some endpoints.
Thanks,
Rob.