I am using MySQL cluster to handle the parallel processing of a large data stream which occasionally requires big joins (millions of rows) and update/deletes/inserts of multiple (1000s) of rows at a time. It appears to be handling this quite comfortably most of the time.
However I occasionally run into a problem which seems to be caused by maxing out the allowed number of buffered epochs on the SQL nodes causing the node to disconnect. I cannot seem to find a pattern connecting this error with the type of queries being executed at the time of error.
I then get the following exception from mysql
SQLSTATE[HY000]: General error: 1297 Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER
and then hundreds of these errors from subsequent queries
SQLSTATE[HY000]: General error: 1296 Got error 157 'Unknown error code' from NDBCLUSTER
The MySQL Cluster log reports the following
INFO -- Node 7: Disconnecting lagging nodes '0000000000000000000000000000000000000000000000000000000001000000',
WARNING -- Node 7: Disconnecting node 24 because it has exceeded MaxBufferedEpochs (150 > 150), epoch 1758675/1
INFO -- Node 7: Communication to Node 24 closed
INFO -- Node 4: Communication to Node 24 closed
ALERT -- Node 7: Node 24 Disconnected
...
According to the MySQL docs "The default value of 100 is sufficient for most normal operations". I increased the value of MaxBufferedEpochs to 150 but this has not helped.
I have seen people using configurations with very high numbers for MaxBufferedEpochs (10000) but this seems very high.
Can anybody explain why I might be getting this error and whether or not it is acceptable to increase this value by so many orders of magnitude? Also, is there anything else I could/should change in order to stop this from happening?