0

Backuping a MongoDB cluster composed of three nodes on a Kubernetes on-premise cluster using Velero and MinIO with Restic, triggers this fatal error of one of them after restoring the backup:

"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:226:WiredTigerHS.wt: potential hardware corruption, read checksum error for 4096B block at offset 172032: block header checksum of 0x63755318 doesn't match expected checksum of 0x22b37ec4" "ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:235:WiredTigerHS.wt: fatal read error","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802 "ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:235:the process must exit and restart","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804 Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":712 \n\n***aborting after fassert() failure\n\n Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n

Please note that:

  • we use the same application on Azure and no error like this is triggered, the backup and the restore on Azure works as expected
  • we tested it using versions 4.4.11 and 6.0.5

We proceeded with the following steps:

  • backup the entire namespace of our application (the application is not being used during this time)
  • delete the namesapce
  • delete the claimRef of all PV (in order for them to be available again)
  • remove all persistent data stored on the Kubernetes nodes
  • restore the entire namespace

In this namespace we use Cassandra, RabbitMQ and MongoDB. Everything is being restored well (including two MongoDB nodes) except one MongoDB node which is most of the time in a "Back-off restarting failed container" state (even after having triggered a manual "mongod --repair" on it).

Do you know what could cause this issue and how we could solve it?

0 Answers0