Setting both min_pending_latency
and max_pending_latency
sends mixed messages to the autoscaler.
More generally, you can tweak the autoscaler to either contain your costs (set a low value for max_idle_instances
and/or a high one for min_pending_latency
), or to improve your scalability -- that is, keep latency low for surges of traffic (set a high value for min_idle_instances
and/or a low one for max_pending_latency
).
Don't mix the two kinds of tweaks -- such "mixed messages", in my experience, never result in good effects on either costs, or latency during a surge.
And yes, I am working to have this fundamental bit of information become part of Google Cloud Platform's official docs -- it's just taking longer than I hoped, which is why, meanwhile, I am posting this answer.
A more advanced alternative, if you're very certain about your patterns of traffic over time, possibilities of surges, and so forth, is to switch from auto-scaled modules to basic-scaled or even manual-scaled ones, writing your own code to start and terminate instances via the Modules API.
Alhough, I have to admit, this never worked optimally for me, for modules dedicated to serving user traffic (as opposed to task-queues or cron-based "backend" work) -- my users' surges and time patterns never were as predictable going forward, as analyzing past records tantalizingly suggested. So, in the end, I always went back (for user traffic servicing) to good old auto-scaling, perhaps with the modest tweaks either to reduce costs, or to improve scalability, as I recommend above.