We are using Microsoft Azure Bot Service with QnA Maker cognitive service to develop a QnA bot for a client. Functionality-wise, the bot is ready, but we've been doing some load tests and noticed some unexpected behaviour.
With single instance of QnA Maker app service (S1 plan, always on enabled), if there are no requests for some while, application seems to go into idle state and memory working set drops to around 30MB. When new requests start hitting the endpoint, it takes a long while for application to reinitialize (KB has 263 QnA pairs, some of which consist of metadata, prompts and alternate questions).
It can be as long as a minute and a half until first response is returned. It can be seen on Working memory set metrics that application loads around 550MB into memory in that timeframe and only then starts to process queries and send responses. This could be tolerable if it was some single occurrence warmup time, but this seems to happen every time QnA Maker does not receive new queries for some time.
When this happens, a lot of initial requests from the load test (jmeter, BotServiceStressToolkit) fail because of the timeout.
Additionally, even if we try to set up autoscaling to increase instance count, each of those instances will go through the same process and cause new requests sent from load balancer to fail until instance initializes completely. Of course, meanwhile the existing instances can occasionally fail if they get overloaded with requests.
End result of this is around 15% failure rate during the load test.
Same thing if we set fixed instance count (ie. 5) - if one of these idles and then gets hit with a request, it will again start initializing and cause an issue.
Are there any suggestions from community on how this can be handled and what might cause a problem? Let me know if any additional data/information is needed, I will check on this issue often and edit the question accordingly.