We have 2 instances of the same galera cluster, so the same database schema with a data difference of 1 month. respectively they are a staging and a production environment.
They run on an identical hardware configuration, same operating system, same galera version, same configuration as a 3 node configuration. The network latency is a non-problem each node resides on a different docker container on the same machine and they are regularly backed up.
There is a particularly heavy query that in the staging env takes up to 2 seconds to run, while on the production env actually gets up to 2 minutes and half. Examining the query plan the magnitude of the number of records difference for each step is not that far apart, but the actual query plan varies in strategy in the 2 following steps:
staging (2s):
8 DERIVED b ref bay_depend_id,bay_relies_on_id bay_depend_id 4 rwms.l.bay_id 2 Using where
8 DERIVED b ref bay_depend_id,bay_relies_on_id bay_depend_id 4 rwms.l.bay_id 2 Using where
production (2mins):
8 DERIVED b ALL bay_depend_id,bay_relies_on_id 1 Using where; Using join buffer (flat, BNL join)
8 DERIVED b ALL bay_depend_id,bay_relies_on_id 1 Using where; Using join buffer (incremental, BNL join)
The query plan for all the other steps of the query is actually the same with the difference being in the number of records, but at these 2 steps the actual evaluation strategy changes. And I can't really wrap my head around it, or on how to resolve this issue, mostly because every working condition is radically the same.