How can I track down why the rpm DB on my servers keeps getting corrupted?

Question

Running Centos 7 (various versions)

The rpm DB on my servers keeps getting corrupted. It seems like every few weeks I have to do an rpm rebuild on a server or two.

Where should I look to see what could be the culprit? I know how to fix this when it happens, but how can I identify if this is a specific package I'm installing or if something is triggering this?

Do you maybe have one of these corrupted databases? As in one I (we) could see? — Spooler, Apr 16 '17 at 03:27
hmm I think they are all fixed right now, but if I find one that is currently corrupt, what would I post? Just a dump of the DB? — red888, Apr 16 '17 at 03:28
A link to it in a comment or chat would be fine. However, I have no idea what would be causing this issue at this point so I'm just trying to fish for any data that might help. What do these boxen do? Do they share a common workload of some kind beyond base packages? — Spooler, Apr 16 '17 at 03:38
They serve different purposes, the rpm issue doesn't seem to follow a specific role. What would you check in the DB for any red flags? maybe the last package that was installed/modified before the corruption? — red888, Apr 16 '17 at 03:41
Is `/var/lib/rpm` on a local filesystem so the lock file can work? Is some auto-backup/restore mechanism overwriting the files? Do sysadmins use `kill -9` on rpm, yum, dnf commands? You can use [auditctl](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/sec-Defining_Audit_Rules_and_Controls.html) to log all accesses to the db files. — meuh, Apr 16 '17 at 14:54
What is the exact error message you are getting which causes you to rebuild the RPM database? — fpmurphy, Apr 17 '17 at 00:50
Perhaps you have an automated task which updates the yum cache on a lot of your servers at the same time, which overloads whatever repo you have configured? (i.e. 1000 servers that all run "yum makecache" at midnight?) — Nils, Feb 06 '18 at 12:26
Make sure that /var/cache/yum isn't filing up the filesystem. Depending on the number of repo's, you need at least 3G free. — Bob, Apr 11 '18 at 18:53
find out what's changing on those servers. "yum history" would be a good place to start. — frontsidebus, Jul 25 '18 at 15:51
You can apply RHEL7 solution from here: https://access.redhat.com/solutions/3330211 — Hardoman, Dec 16 '20 at 09:57
Other option is to install and run with cron dcrpm/ https://github.com/facebookincubator/dcrpm/blob/master/README.md But the installation never completed successfully for me because of multiple python dependencies didn't work (psutil failed to install). — Hardoman, Dec 16 '20 at 09:59

score 1 · Answer 1 · answered Dec 16 '20 at 10:25

There's an endless row of bugs where BDB environment getting corrupted, some of which have been BDB bugs (several found just in the last couple of years) that have been patched in Fedora/RHEL libdb but upstream BDB 5.x does not have, dunno about 6.x but there you run into the licensing side. This one is well know issue that has no permanent solution.

Root Cause:

If rpm or yum does not exit cleanly the lock files are left behind. The files (__db001 - __db005) are left behind in /var/lib/rpm. We can see the pid that left the files with. The problem tends to be that we have no logs or audit configure for what actually killed the process. The most common reason being an automation tool timed out and abruptly ends the process without letting rpm clear the lock files.

One possible workaround is to force use of private environment. That also means practically no locking, but at least it means queries will not corrupt anything (however queries themselves could return garbage if run in middle of write-operation). That's what happens if you run queries as non-privileged user, but since you can control permissions with sandboxing you can achieve the same by disallowing open of /var/lib/rpm/.dbenv.lock, which causes rpm to fall back to a private environment - meaning it wont open, much less write to those __db.* files.

The developers statement is that it won't be fixed completely:

"Making BDB more reliable would require using transactions there, but this would be an incompatible change, which is the last thing we want to do at this point when we're basically just about to deprecate BDB. Which means we cannot do anything about this, on Berkeley DB backend, unfortunately."

They provide a suggestion to use dcrpm utility.

dcrpm ("detect and correct rpm") is a tool to detect and correct common issues around RPM database corruption. It attempts a query against your RPM database and runs db4's db_recover if it's hung or otherwise seems broken. It then kills any jobs which had the RPM db open previously since they will be stuck in infinite loops within libdb and can't recover cleanly.

You can download it from Git repo. The official guide is available at the same place.

Here is what you need to do for instalaltion:

# git clone https://github.com/facebookincubator/dcrpm.git
# cd dcrpm
# python setup.py install

After the installation you can run the tool and add it to cron:

# dcrpm

Unfortunately the installation always failed for me on CentOS 7 because of python dependencies never installed properly.

error: Setup script exited with error in psutil setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.

This is despite psutil got installed successfully. But some other people reported dcrpm worked well for them, so give it a try.

I have used another official solution from Red Hat (RHEL 7).

# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel7/rpm-deathwatch-rhel-7.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
# yum install -y kernel-{devel,headers}-$(uname -r) systemtap && debuginfo-install -y kernel
# yum install rpm-deathwatch
# systemctl start rpm-deathwatch
# systemctl status rpm-deathwatch

How can I track down why the rpm DB on my servers keeps getting corrupted?

1 Answers1