How to test critical reads and slow synchronisation with a galera cluster

Question

This question is not on how to solve issues with replication, the intention is to find bugs caused by slow replication. For performance, we do not want all queries to be synchronous, just queries we identify as critical read.

We have sometimes bugs concerning synchronization on our galera cluster. For example, our web application does a redirect after writing data but shows an outdated state on the next page. On the development environment we do not have these problems. On production with some server load, another node is sometimes not synchronized if we read the data written a few milliseconds before.

To solve this, we use node pinning for critical reads, to read from the same node as written before and we are experimenting with SET SESSION wsrep-sync-wait=6; for INSERT/UPDATES/DELETE like described here to ~~avoid~~ reduce that behavior (and now with the bit "1" like rick-james mentioned).

How to test for bugs caused by slow replication?

Our idea is to simulate a very slow synchronization to test our application for critical read behavior. Is there some config option to let a galera cluster act like under heavy load? Galera has a built-in flow control to slow down, but I could not find a reliable way to force a cluster into flow control. The solution does not have to rely on MySQL alone, a slow virtual volume combined with something like "innodb_flush_method" might be helpful, too.

(Updated to hopefully improve the question)

score 2 · Answer 1 · answered Nov 26 '18 at 21:10

2

You need to include 1 because you are reading.

SET SESSION wsrep_sync_wait=1;   -- or any odd number
SELECT ... ;  -- this will wait if necessary
SET SESSION wsrep_sync_wait=0;   -- (optional); to avoid slowing down subsequent SQL

answered Nov 26 '18 at 21:10

Rick James

135,179
13
127
222

You are right, we cannot avoid this with a 6. But we cannot do this for all selects. We already tried this with the unexpected effect bringing down the whole cluster. We seem to have too many write requests. So we still have to identify critical selects and need some way to simulate a slow cluster for testing. Using a value of 6 just slows down writing requests on heavier load. – Trendfischer Nov 28 '18 at 08:58
@Trendfischer - For a `SELECT`, I think _only_ 1 is needed. Abandon the `6`. – Rick James Nov 28 '18 at 15:56
Sorry, I did not explained it in detail: The 6 is for INSERTS/UPDATES, by waiting for synchronization on writing, we reduce the risk of a lacking synchronization in a multi node write environment. Sadly, this is node based, not cluster based. On ideal circumstances, this would delay the COMMIT until all nodes would have written the data. But galera just delays until all nodes state, they could write the data. Our intended purpose is, to slow down the writes, not the reads with SELECT, what a "1" would do and did not work for us. – Trendfischer Nov 30 '18 at 08:37
Maybe you should use 6 (or maybe 7) in front of writes and 1 in front of reads? – Rick James Nov 30 '18 at 22:03
Sorry, but we cannot use 1 in front of all reads. The "1" works similar to a LOCK TABLES READ on every write request. We have around 5000 reading queries per second, imagine all of them waiting for a single slow writing query. This is too slow and that's why I can't pin this as the accepted answer. This question is about how to simulate a slow synchronization to find critical reads (which should wait). Our problem is finding bugs, not fixing them. – Trendfischer Dec 03 '18 at 09:31
Part of the challenge is to make sure Writes are "fast". Let's see one of your slow writes, plus `SHOW CREATE TABLE`; maybe we can help speed it up. – Rick James Dec 03 '18 at 19:53
1

I have around 15 years experience in programming applications with MySQL and yes, there are a few queries which could still be optimized. Even if I squeeze 5-10% from one or the other query, how should I know, if it is enough? At the moment, I only have a tool like GoReplay to simulate load, but it is not very reliable in reproducing synchronization issues. – Trendfischer Dec 04 '18 at 14:47
@Trendfischer - How about this: Focus (via a proxy, etc) on keeping each user connected to the same node, then you can probably avoid the setting. – Rick James Dec 04 '18 at 16:50
Yes, that is what I was referring by "node pinning". Accessing the cluster as if it would be sharded could be working, too. Letting the writing web request check the other nodes until the written data is there could be another option. Although the last two are tasks I'd prefer to have as feature in galera itself. – Trendfischer Dec 05 '18 at 12:35

score 0 · Answer 2 · answered Aug 13 '19 at 18:49

0

Set up a geographically-dispersed cluster. Perhaps use some cloud service for distant nodes.

If you had each node in a different continent, the round trip delay would be 100-200ms for each COMMIT.

answered Aug 13 '19 at 18:49

Rick James

135,179
13
127
222

The speed of light won't get much above 20ms on a round trip. With some unreliable networks, we can get the 100-200ms but in reality it would be more like a big range with 10-300ms. With this, nodes might get evicted from the cluster. Has anyone successfully done this? – Trendfischer Aug 14 '19 at 16:20

How to test critical reads and slow synchronisation with a galera cluster

2 Answers2