galera priority abort and thread certification errors

Question

We are running Percona MySQL 5.5 XtraDB cluster (2 nodes and an arbitrator) galera 2-2.12. I'm running haproxy to provide a transparent proxy (through iptables TPROXY) aimed specifically at one node unless it's unavailable. Every 10-20 days we run into an issue that looks something like this. First sign of an issue in the logs is:

[Warning] Too many connections

This will continue for a few minutes, but then we get something like:

TRANSACTION 2B37091B, ACTIVE 1506 sec, thread declared inside InnoDB 499
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 1498250, OS thread handle 0x7efccc658700, query id 14839064 <db02> <db02 ip> <db> wsrep in pre-commit stage
<update query>
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 773 page no 1707 n bits 304 index `date` of table <table> trx id 2B37091B lock_mode X locks rec but not gap
170427 14:03:47 [Note] WSREP: cluster conflict due to high priority abort for threads:
170427 14:03:47 [Note] WSREP: Winning thread: 
   THD: 4, mode: applier, state: executing, conflict: no conflict, seqno: 38463147
   SQL: (null)
170427 14:03:47 [Note] WSREP: Victim thread: 
   THD: 1498250, mode: local, state: committing, conflict: no conflict, seqno: 38463644
   SQL: <update query>

Then we will get a bunch of:

170427 14:03:49 [Note] WSREP: cluster conflict due to certification failure for threads:
170427 14:03:49 [Note] WSREP: Victim thread: 
   THD: 1498309, mode: local, state: executing, conflict: cert failure, seqno: 38463678
   SQL: <insert query>

Once these are done, the cluster is back to normal. While this is going on, the cluster is hosed and end users report DB interruption. It didn't use to resolve by itself, but once I added this to my config it started to recover from this event within 1-5 minutes:

wsrep_provider_options="gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0"

My DB config:

[client]
socket=/var/lib/mysql/mysql.sock

[mysqld]
server-id=<id>
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
log-bin
log_slave_updates
expire_logs_days=7
symbolic-links=0
wsrep_provider=/usr/lib64/galera2/libgalera_smm.so
wsrep_cluster_address=gcomm://<gcom string>
binlog_format=ROW
default_storage_engine=InnoDB
wsrep_slave_threads= 8
wsrep_log_conflicts
wsrep_cluster_name=<cluster name>
wsrep_node_name=<node name>
wsrep_node_address=<node ip>
wsrep_provider_options="gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0"
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth=<redact>
max_connections=300
innodb_buffer_pool_size=20G
innodb_additional_mem_pool_size = 20M
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_instances = 20
innodb_lock_wait_timeout = 120
innodb_log_buffer_size = 8M
innodb_log_file_size = 48M
innodb_log_files_in_group = 3
innodb_max_dirty_pages_pct = 90
innodb_read_io_threads = 8
innodb_thread_concurrency = 16
innodb_write_io_threads = 8
innodb_file_per_table = 1

The tables on which the queries always seem to reference look like so:

+-----------+--------------+------+-----+---------+-------+
| Field     | Type         | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| date      | date         | NO   | MUL | NULL    |       |
| page_name | varchar(100) | YES  | MUL | NULL    |       |
| page_hits | float        | NO   |     | 1       |       |
+-----------+--------------+------+-----+---------+-------+

+-----------+-------------+------+-----+---------+-------+
| Field     | Type        | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+-------+
| ip_hash   | varchar(32) | NO   | MUL | NULL    |       |
| timestamp | timestamp   | YES  | MUL | NULL    |       |
+-----------+-------------+------+-----+---------+-------+

I'm sort of at wits end with galera at this point and ready to move back to standalone mysql, but any advice would be greatly appreciated.

Probably related to this bug: https://jira.percona.com/browse/PXC-1001 — Alex K., Oct 29 '20 at 12:41

galera priority abort and thread certification errors

0 Answers0