I am running two MySQL databases -- one is on an Amazon AWS cloud server, and another is running on a server in my network.
These two databases are replicating normally in a multi-master arrangement seemingly without issue, but then every once in a while -- a few times a day -- I get an error in my application saying "Plugin instructed the server to rollback the current transaction."
The error persists for a few minutes (around least 15 minutes), and then goes back to replicating normally again. In the MySQL Error logs, I don't see any error, but in the normal log file I do see the rollback happening:
2018-09-10T22:50:25.185065Z 4342 Query UPDATE `visit_team` SET `created` = '2018-09-10 12:34:56.306918', `last_updated` = '2018-09-10 22:50:25.183904', `last_changed` = '2018-09-10 22:50:25.183904', `visit_id` = 'J8R2QY', `station_type_id` = 'puffin', `current_state_id` = 680 WHERE `visit_team`.`uuid` = 'S80OSQ'
2018-09-10T22:50:25.185408Z 4342 Query commit
2018-09-10T22:50:25.222304Z 4340 Quit
2018-09-10T22:50:25.226917Z 4341 Query set autocommit=1
2018-09-10T22:50:25.240787Z 4341 Query SELECT `program_nodeconfig`.`id`, `program_nodeconfig`.`program_id`, `program_nodeconfig`.`node_id`, `program_nodeconfig`.`application_id`, `program_nodeconfig`.`bundle_version_id`, `program_nodeconfig`.`arguments`, `program_nodeconfig`.`station_type_id` FROM `program_nodeconfig` INNER JOIN `supervisor_node` ON (`program_nodeconfig`.`node_id` = `supervisor_node`.`id`) WHERE (`program_nodeconfig`.`program_id` = 'rwrs' AND `supervisor_node`.`cluster_id` = 2 AND `program_nodeconfig`.`station_type_id` = 'osprey')
... Six more select statements happen here, but removed for brevity...
2018-09-10T22:50:25.253520Z 4342 Query rollback
2018-09-10T22:50:25.253624Z 4342 Query set autocommit=1
In the log file above, the Query UPDATE that is attempted in the first line gets rolled back even after the commit statement, and at 2018-09-10T22:50:25.254394I received an application error saying that the query was rolled back.
I've seen the error when connecting to both databases -- both the cloud and internal.
Does anyone know what would cause the replication to fail randomly, but on a regular basis, and then go back to working again?