Exchange DAG Auto Failover - Reasons

Question

Our 2013 DAG seems to somewhat arbitrarily activate DBs on other servers and move them from the ones they were active on. In looking at metrics there were no noticeable spikes in RAM/IO/Networking/etc so I'm not sure why there are moving.

I can't find how to audit why databases move and am looking for a log file or powershell cmdlet which may help to troubleshoot this.

For clarification, simplifying things a lot: Server 1 has DB1 active Server 2 has DB2 active Server 3 has DB3 active

Each server has passive copies of the other two databases. Overnight, for no apparent reason things will move and look like this:

Server 1 has DB1 and DB3 active Server 2 has no DBs active Server 3 has DB 2 active

Thanks for any help!

PS: In case anyone is dealing with this and wants to stop it at the loss of some features (ie autofailover) consider using the following policy on each server you want to stop autofailover on:

Set-MailboxServer -Identity EXSRV01 -DatabaseCopyAutoActivationPolicy Blocked

Where EXSRV01 is replaced with the name of the Exchange server to stop autoactivation on.

Nope, updates are all manual every other week with no scheduled reboots — Abraxas, Nov 05 '15 at 17:38
I'm wondering if it's something to do with the backup process? We have two sites with two servers at each and usually they, because of site preference and DNS round robin/load balancing just have both dbs on one server then it'll bounce back to the other. Sometimes though they fail across sites and that usually leads to a server being overwhelmed and having IIS problems. Then things like NTLM auth stop working and everyone gets angry lol :( @joeqwerty — Abraxas, Nov 05 '15 at 17:39

score 3 · Answer 1 · answered Nov 05 '15 at 18:26

3

If these are VMs, and the backup process involves getting a Vmware snapshot, you could be timing out on the allowed DAG heartbeat. You need to set the SameSubnet and CrossSubnet delay and threshold values higher than defaults.

http://www.veeam.com/blog/how-to-backup-exchange-database-availability-groups-dags-with-veeam-backup-replication.html

cluster /prop SameSubnetDelay=2000:DWORD cluster /prop CrossSubnetDelay=4000:DWORD cluster /prop CrossSubnetThreshold=10:DWORD cluster /prop SameSubnetThreshold=10:DWORD

answered Nov 05 '15 at 18:26

mfinni

36,144
4
53
86

You should be able to see MSFT Cluster errors to determine this prior to making heartbeat network changes. There will ALWAYS be an error when a DAG fails over, just may have to dig a bit in Event Viewer. – Chase Nov 05 '15 at 21:11
That's good to know I wasn't sure if it would be in one of the text log files or somewhere in eventviewer. I'm mostly looking for 'where' to look. – Abraxas Nov 05 '15 at 21:12

score 2 · Accepted Answer · answered Nov 06 '15 at 17:46

I'll add to my comment for a more complete answer. Building on mfinni's response on clustering, if a database fails over there is always an error. Exchange's default reaction to anything erroneous is to fail a database over to protect against split brain scenarios (both databases thinking their active and causing crimes against humanity).

You can have a perfectly reasonable CPU/Memory and seemingly no network blips but in MSFT Clustering you'll see failures for many reasons. If clustering thinks it's having an issue, it does the fantastic job of RESTARTING the clustering service to ensure everything is working. When that happens, Exchange will fail over ALL databases. This can be caused by many issues similar to:

High memory usage beyond a mailbox servers already crazy memory allocation (2013 does a better job here)
List item
Network "blips"; don't offend your network admin here, it could literally be a TTL increase on the heartbeat network OR even a reset to a vswitch for whatever reason
Vmotion.... but you have that off correct because that's not supported. ;-)

Clustering event viewer logs will give you the time the "failure" occured, and you can correlate this to the High Availability event viewer logs to figure out if there was a build up of an issue or if it was a sudden event. I've seen where the database itself was just too busy trying to keep up with some mail bombs some departments caused by out of control cron jobs and this caused the transaction log to go over the replication threshold limits for database health... boom... failover.

If you find anything in those logs, post it (scrub sensitive data) and I can help more. And make sure you're patched current across all Exchange servers. There were a few CU updates that caused similar issues for no reason.

Exchange DAG Auto Failover - Reasons

2 Answers2