0

I have a web app running in Azure that was working fine for a couple of years. It's been connecting to a MS SQL Server instance that runs on a VM, mainly because part of it is a legacy app that needed features that Azure SQL didn't support. We've removed some of those old dependencies, and Azure now has SQL Server Managed Instances, so I decided to give that a try.

For some reason though, now many of the web requests that need to access the db have one or more 14-second pauses in the trace. For example:

14-second stalls between some db calls

Each of those SQL lines represents another SQL call to the db. For some of these traces, the whole thing takes around 250ms and the SQL calls are all back-to-back. For some like these, you see nothing for 14 seconds, then one of the SQL calls, then another 14-second gap, and the rest happen in quick succession. Sometimes it's 14 seconds and then they all get called quickly. Sometimes it's 14 seconds, then the first call, then another 14 seconds, then the second call, then another 14 seconds, and then the rest back-to-back.

The histogram of timings is just very odd:

histogram of request timings

I cropped out the sub-300ms responses, which make up around half of the total. The rest are shown here. The chart doesn't line up completely, but the first tall bar is at 14.3 seconds, the second bar is 28.7 seconds, and the third is at 43.2 seconds.

Of course if I switch the connection string back to the VM, everything goes back to running very quickly. Then I go back to the SQL Managed Instance. The code isn't changing at all. The other thing that's odd is that the first SQL line in that timing chart is a completely different db that isn't being changed, so somehow having the main db use the Managed Instance can also mess up the connection to Azure SQL? I don't think it's strictly caching or anything since I can refresh the page many times in a row and sometimes it's quick, sometimes it's 14-28 seconds. I've also tried changing up the conn string to MI, using private and public hostnames.

Bottom line--is there some reason why one or more calls to SQL Server would have 14-second gaps?

mherzig
  • 1,528
  • 14
  • 26
  • These "slow-queries" happen in sequence? like, after the first one, all of them are running slowly? – Leonardo Jan 07 '20 at 20:10
  • More like, there _may_ be 14 seconds before the first query. If so, then there _may_ be another 14 seconds before the second query. If so, then there _may_ be another 14 seconds before the third. At some point they just all follow through and complete like they should. – mherzig Jan 07 '20 at 20:14
  • Are you using Azure SQL Database instance, ElasticPool or Single Database? Are you hitting the allocated capacity? Have you tried using a bigger tier? – Leonardo Jan 07 '20 at 20:22
  • Do all these "slow" queries only come from your application requests? In other words, if you run queries while connecting from SSMS, is there the same "sometimes" 14 second delay? – LaraRaraBoBara Jan 07 '20 at 20:27
  • @Leonardo The Azure SQL db (which we haven't had an issue with before) is in an elastic pool. The SQL Managed Instance is set up with a failover group. I haven't tried connecting directly to the primary of the failover group, although both hostnames resolve to the same IP so I'm not sure the issue is specifically with that. Might be something to try though. – mherzig Jan 07 '20 at 20:32
  • @FembotDBA Ah I forgot to mention that--I've been using DataGrip to run queries and have _not_ noticed any stall. – mherzig Jan 07 '20 at 20:32
  • @mherzig any updates on the resource utilisation? are you being capped? – Leonardo Jan 07 '20 at 20:56
  • I can't find anything on the MI about memory utilization, but the CPU utilization has _peaked_ at **1%**. Seems not to be a resource issue. And the MI is provisioned similarly to the VM it's replacing. – mherzig Jan 07 '20 at 20:57
  • peaked at 1% while you were experiencing 14sec delays? – Leonardo Jan 07 '20 at 21:55
  • Just during usage, period. I'm seeing one peak up to 3% CPU max. – mherzig Jan 07 '20 at 22:00
  • Have you tried to look at Query Store in managed instance and see whether some query now actually takes 14 seconds to be executed, due to a plan regression? There might be some differences in database engine settings between SQL on VM and SQL MI, like compat levels, trace flags, system and database-scoped configurations. Check "Database engine settings" section in this blog post: https://azure.microsoft.com/en-us/blog/key-causes-of-performance-differences-between-sql-managed-instance-and-sql-server/ – Mladen Andzic - Msft Feb 04 '20 at 13:57
  • @mherzig Did you find any solution? I am facing similar problem. Some database calls from Azure Functions app have long delay before actually connecting to database. – Asiful Nobel Nov 20 '21 at 19:37
  • @asiful Not really, we ended up eventually doing two things—moving to Azure SQL instead of MI, and also we found a few places where the code path wasn’t async. After deploying those changes it no longer stalled. If your stalls are multiples of 14.3 seconds then maybe that’s something to look at. If not, then it’s probably a different problem and I would recommend doing some diags and posting a new question. This one is unlikely to get new eyes on it at this point. – mherzig Nov 21 '21 at 03:47

0 Answers0