Azure SQL databases on same server affect each other's performance

Question

The DBs corresponding to the different environments (e.g. production, staging) of our web app are on the same Azure SQL database server. While I read everywhere that an Azure SQL server is just a logical container (and the DBs on it may not even be on the same physical machine) we see signs of DBs behaving as noisy neighbors, i.e. operations made on one of them affecting the performance of others.

We've seen the below operations and metrics happening correlate between DBs. Let's call one DB "prod" and the other "stage"; in all cases stage was created by copying prod with the Start-AzureWebAppSqlDatabaseCopy PowerShell commandlet.

Scaling up stage correlates with Data IO spike on prod.
Running performance-heavy operations on stage (deleting thousands of tables, updating around 10 thousand rows) correlates with SQL connection timeouts ("The timeout period elapsed prior to completion of the operation or the server is not responding.") and Data IO spikes on prod.

With both DBs we use separate DB-level user accounts (on why, see this SO post), but the prod and stage user accounts both exist under both DBs (i.e. we use the stage user to connect to the stage DB, but the stage user also exists under the prod DB, and the prod user also exists under the stage DB). We dropped the stage user from the prod DB to see if that makes a difference, but it didn't.

It may be worth noting that when the Web/Business Azure SQL tiers were phase out these DBs were migrated from Web to their current S1 tier. We see the same issue with DBs on another server too. The DBs are NOT part of an elastic pool.

Our findings are inconclusive and these events don't correlate 100% of the time either. We're out of ideas what to investigate, as we are sure that the stage app doesn't connect to the prod DB. We tried to find evidence of the stage app somehow affecting the prod DB but we couldn't. Any input would be appreciated.

Update 1

Using Grant's sys.dm_os_wait_stats tip, as well as sys.dm_os_performance_counters it is evident that yes, if you make a copy of the database on the same logical server it will be created on the same physical SQL Server too. Server name in object_name is the same, wait values are exactly the same.

This doesn't explain however, why operations on the copy affect the original DB. Since it seems that the noisy neighbor effect doesn't happen all the time (the scale up does affect the original DB most of the time, the perf-heavy operations less so, but the correlation is still pronounced) it might be some random Azure problem.

We'll see if using a different logical server fixes the issue. What's sure is that in that case the physical server will also be different, we've checked that.

Update 2

We're monitoring the situation but whether this indeed solves the issue will most possibly be apparent only after several months. For now we have put all DBs on separate servers.

We did notice timeouts on the prod DB always in the same time interval after all operations on the stage DB completed. These timeouts however seem to only happen for table creations. It's like after copying the prod DB to the stage DB the prod DB is somewhat "locked" for a period of time (about 45-60 minutes) and you can't create tables (but you can drop them, those work). Funnily enough this didn't happen today, so maybe it has resolved itself...

score 2 · Answer 1 · answered Jul 31 '18 at 01:48

From the information you provide I suspect the issue is the workload of your databases is occasionally I/O intensive, is hitting the tier limits and Azure SQL starts to throttling. That throttling may be behind those timeouts.

Please monitor resource consumption using below query:

SELECT 
    (COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
    ,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
    ,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats

Service level objective (SLO) of 99.9% <= go to next tier.

Measure DTU consumption over time. Are you getting timeouts when the following query shows high DTU usage.

SELECT start_time, end_time,   
  (SELECT Max(v)    FROM (VALUES (avg_cpu_percent), (avg_data_io_percent),
   (avg_log_write_percent)) AS value(v)) as [avg_DTU_percent] 
FROM sys.resource_stats where database_name = 'AdventureWorksLT'  order by end_time desc

Compare DTU usage vs DTU limit.

SELECT 
 end_time AS [EndTime]
  , (SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v)) AS [AvgDTU_Percent]  
  , ((dtu_limit)*((SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v))/100.00)) AS [AvgDTUsUsed]
  , dtu_limit AS [DTULimit]
FROM sys.dm_db_resource_stats

Thanks, I'll check out sys.dm_db_resource_stats next time we experience this issue. BTW as I mentioned it's not that we intensively use a DB and it gets throttled; it's that _another_ DB on the same server sees excessive resource usage. It's like the two DBs share the same pool of resources despite Azure documentation suggesting the contrary (and it's not an elastic pool). What sys.resource_stats shows is in my understanding the same I can see on the Portal, and I can also see spikes as on the Portal's charts. — Piedone, Aug 01 '18 at 19:21
Sometimes the graph in the portal does represent correctly the information contained on sys.dm_db_resource_stats, — Alberto Morillo, Aug 01 '18 at 20:51
Tthough I've seen the resource usage increase clearly in the chart. As mentioned I can see the same spikes in the sys.resource_stats data set. — Piedone, Aug 01 '18 at 22:44

score 1 · Answer 2 · answered Jul 30 '18 at 23:51

1

The way I would go about determining if this was the case would be to use sys.dm_os_wait_stats along with sys.dm_db_wait_stats. The OS wait stats are for the "server" that your database is running on and the db wait stats are for the database. Gather the db waits for both databases in question and the os waits for both databases. First, straight up compare the os waits. if they're the same (with some margin, I wouldn't expect them to be exactly the same, although, if they are, there's your answer), you may be seeing everything on the same server. If they're not really the same, but sort of similar, then compare the db wait stats for each database to the OS wait stats and see if you can see a direct correlation.

For management purposes only, I would probably separate these on to different servers anyway, even if this wasn't normally an issue. However, if you can find a correlation, then the best bet is probably to break apart the servers. It won't cost you anything. You pay for the database, not the server.

answered Jul 30 '18 at 23:51

Grant Fritchey

2,645
19
21

Will check those metrics, thank you. We also thought about putting the DBs on different servers just to rule that out, and already started doing it, so we'll see if that solves the problem (even if it does, I'll try to understand the causes still). – Piedone Aug 01 '18 at 19:02
Wouldn't sys.dm_os_performance_counters also give a hint on the server being the same? The object_name column's values are prefixed with the server name (like "MSSQL$8E2D1FA3:Buffer Manager". – Piedone Aug 01 '18 at 20:02
It seems indeed the two DBs are on the same machine, see my updates. – Piedone Aug 01 '18 at 22:43
Ah, I see they are the same. Good to know. @alberto morillo has the right track for determining the load on the system. – Grant Fritchey Aug 02 '18 at 18:35

Azure SQL databases on same server affect each other's performance

2 Answers2