The DBs corresponding to the different environments (e.g. production, staging) of our web app are on the same Azure SQL database server. While I read everywhere that an Azure SQL server is just a logical container (and the DBs on it may not even be on the same physical machine) we see signs of DBs behaving as noisy neighbors, i.e. operations made on one of them affecting the performance of others.
We've seen the below operations and metrics happening correlate between DBs. Let's call one DB "prod" and the other "stage"; in all cases stage was created by copying prod with the Start-AzureWebAppSqlDatabaseCopy
PowerShell commandlet.
- Scaling up stage correlates with Data IO spike on prod.
- Running performance-heavy operations on stage (deleting thousands of tables, updating around 10 thousand rows) correlates with SQL connection timeouts ("The timeout period elapsed prior to completion of the operation or the server is not responding.") and Data IO spikes on prod.
With both DBs we use separate DB-level user accounts (on why, see this SO post), but the prod and stage user accounts both exist under both DBs (i.e. we use the stage user to connect to the stage DB, but the stage user also exists under the prod DB, and the prod user also exists under the stage DB). We dropped the stage user from the prod DB to see if that makes a difference, but it didn't.
It may be worth noting that when the Web/Business Azure SQL tiers were phase out these DBs were migrated from Web to their current S1 tier. We see the same issue with DBs on another server too. The DBs are NOT part of an elastic pool.
Our findings are inconclusive and these events don't correlate 100% of the time either. We're out of ideas what to investigate, as we are sure that the stage app doesn't connect to the prod DB. We tried to find evidence of the stage app somehow affecting the prod DB but we couldn't. Any input would be appreciated.
Update 1
Using Grant's sys.dm_os_wait_stats
tip, as well as sys.dm_os_performance_counters
it is evident that yes, if you make a copy of the database on the same logical server it will be created on the same physical SQL Server too. Server name in object_name
is the same, wait values are exactly the same.
This doesn't explain however, why operations on the copy affect the original DB. Since it seems that the noisy neighbor effect doesn't happen all the time (the scale up does affect the original DB most of the time, the perf-heavy operations less so, but the correlation is still pronounced) it might be some random Azure problem.
We'll see if using a different logical server fixes the issue. What's sure is that in that case the physical server will also be different, we've checked that.
Update 2
We're monitoring the situation but whether this indeed solves the issue will most possibly be apparent only after several months. For now we have put all DBs on separate servers.
We did notice timeouts on the prod DB always in the same time interval after all operations on the stage DB completed. These timeouts however seem to only happen for table creations. It's like after copying the prod DB to the stage DB the prod DB is somewhat "locked" for a period of time (about 45-60 minutes) and you can't create tables (but you can drop them, those work). Funnily enough this didn't happen today, so maybe it has resolved itself...