0

My mail server setup worked for years. Recently I've started experiencing the following problem:

Mail setup: sendmail+dovecot+procmail

Host file server: CentOS 6.8, NFS exports mail directories to...

Mail server: CentOS 7.3, running as guest VM on host via libvirtd/qemu, NFS mounts /var/spool/mail from host.

Symptoms: Both dovecot and procmail have issued errors (details below) that seem to indicate they don't have permission to write to /var/spool/mail. However, /var/spool/mail has the most general permissions I know how to give, on both the NFS file server and the mail NFS client.

On the mail server (NFS client):

 $ ls -lhd /var/spool/mail
 drwxrwxrwt 5 root mail 6.8M Mar 29 12:37 /var/spool/mail

In mailserver:/etc/fstab:

 filehost:/mail/inbox      /var/spool/mail         nfs     defaults        0 0

On the NFS host:

 $ ls -lhd /mail/inbox
 drwxrwxrwt. 5 root mail 6.8M Mar 29 12:41 /mail/inbox

In filehost:/etc/exports:

 /mail/inbox          mailserver(rw,no_root_squash,async,nohide)

Neither system is running SELinux or iptables (I rely on our site's firewall).

The kinds of things I see:

  • Files with names like BOGUS.normaluser.hex-string. The corresponding log message is

    Mar 29 12:14:34 mailserver procmail[20922]: Renamed bogus "/var/spool/mail/normaluser.lock" into "/var/spool/mail/BOGUS.normaluser.xGAs"

    This can be exceptionally annoying, since there have been times when it's not just the lockfile that's declared bogus, but normaluser's inbox. From normaluser's perspective, their inbox vanishes as they're reading their mail.

  • Files with names beginning with underscores, e.g., _2-E,eu92YB.mailserver.domain.

    There are no corresponding log messages. The contents of these files (which are always 1 byte or 31-33 bytes) suggest that these are lockfiles. A web page I saw yesterday described someone using strace to identify that procmail is writing these files, but I don't know how to use strace to confirm this for myself (and I can't find the page today).

    When I list the files, I see that they're chmod 400, which may be why they're not being deleted:

-r-------- 1 normaluser    mail 1 Mar 29 12:30 _uZF%kE-2YB.mailserver.domain
-r-------- 1 normaluser    mail 1 Mar 29 12:30 _uZF+kE-2YB.mailserver.domain
-r-------- 1 normaluser    mail 1 Mar 29 12:31 _uZF,kF-2YB.mailserver.domain
-r-------- 1 normaluser    mail 1 Mar 29 12:31 _uZF.kF-2YB.mailserver.domain
-r-------- 1 normaluser    mail 1 Mar 29 12:31 _uZF+kF-2YB.mailserver.domain
  • Lockfiles that don't go away. Typical mail log entry:
Mar 29 12:31:01 mailserver dovecot: imap(normaluser): Error: unlink(/var/spool/mail/normaluser.lock) failed: Operation not permitted

Mar 29 12:31:01 mailserver dovecot: imap(normaluser): Error: file_dotlock_create() failed with mbox file /var/spool/mail/normaluser: Operation not permitted

For the users, a lockfile that doesn't go away means that all their mail processing halts until I manually delete the lockfile. The permissions seem normal:

-rw------- 1 normaluser    theirgroup 33 Mar 29 12:30 normaluser.lock

I've played a bit with the dovecot options, based on the dovecot wiki, hoping that I've made a mistake somewhere. The current relevant values are:

 mmap_disable = yes
 dotlock_use_excl = yes
 mail_fsync = optimized
 mail_nfs_storage = no
 mail_nfs_index = no
 mail_privileged_group=mail

Setting mail_nfs_storage=yes doesn't seem to change anything, since that parameter (according to the dovecot wiki) has to do with multiple mail servers accessing the same directory via NFS, which is not the case here.

I've googled and fiddled, and I can't track down the issue. I'm asking for anything I've overlooked, or for suggestions for additional diagnostics I could run.

Later:

I'm getting closer to a solution. On the client mailserver:

 $ cd /var/spool/mail
 $ sudo -u normaluser touch test
 $ sudo -u normaluser rm test

No problem.

 $ sudo -u dovenull touch test
 $ sudo -u dovenull rm test
 rm: cannot remove ‘test’: Operation not permitted
 $ ls -lh test
 -rw-r--r-- 1 nobody nobody 0 Mar 31 12:03 test

Aha! The dovenull account is not allowed to do anything in the NFS-imported directory. I tried adding a dovenull account to the NFS server (with the same uid/gid), but that hasn't solved the problem:

 $ sudo -u dovenull rm test
 rm: cannot remove ‘test’: Operation not permitted
 $ ls -lh test
 -rw-r--r-- 1 dovenull dovenull 0 Mar 31 12:03 test

This feels like an idmap issue. Here are the only uncommented lines in idmap.conf on both the client and the server:

[General]
Domain = mydomain.com
[Mapping]
Nobody-User = nobody
Nobody-Group = nobody
[Translation]
Method = nsswitch

I'm close... I can feel it...

Yet later:

I can feel all I want, but that doesn't mean I have the answer. I got the dovenull account to be able to both create and delete in /var/spool/mail (it had to do with looking carefully at /etc/nssswitch.conf and realizing I had to restart NIS), but that did not solve my problem. The dovenull account doesn't write to /var/spool/mail.

I used auditctl:

auditctl -w /var/spool/mail -p war -k mail-inbox
ausearch -k mail-inbox > mail-inbox.txt

and verified that the extra .lock files and BOGUS files were being created by dovecot, and the "_" underscore files were being created by procmail. I won't bother posting the audit logs unless someone wants to see them; what they show is that the files are being created with the correct permissions (uid, gid, euid, etc.) and the deletes are unsuccessful even though the delete call is being made with those same permissions.

So what could cause a file to be created, but be unable to be deleted?

  • Moving to something like `maildrop` for the mail delivery agent might make sense, given how long `procmail` has been unmaintained (e.g. open unfixed CVE). – thrig Mar 29 '17 at 19:00
  • My problem appears to be an NFS-related permissions issue, not a limitation of procmail. Unless I were convinced that switching to maildrop would solve the permissions issue, I wouldn't switch, at least not until this serious issue was solved. – William Seligman Mar 29 '17 at 20:03
  • It's been a long time since I was on NFS (and then only as a user) but I too think I smell an NFS glitch. – tripleee Mar 31 '17 at 04:55

1 Answers1

1

I managed to solve this problem, though it revealed another (less crucial) issue.

The clue was that occasionally, when I would list /var/spool/mail on the NFSv4 client, I would see something like this:

-r-------- 1 4294967294    mail 1 Mar 29 12:30 _uZF%kE-2YB.mailserver.domain
-r-------- 1 4294967294    mail 1 Mar 29 12:30 _uZF+kE-2YB.mailserver.domain
-rw------- 1 normaluser    mail 1 Mar 29 12:31 normaluser

Then when I'd do an "ls -lh" immediately afterwards, I'd see:

-r-------- 1 normaluser    mail 1 Mar 29 12:30 _uZF%kE-2YB.mailserver.domain
-r-------- 1 normaluser    mail 1 Mar 29 12:30 _uZF+kE-2YB.mailserver.domain
-rw------- 1 normaluser    mail 1 Mar 29 12:31 normaluser

That number, 4294967294, is -2 in 32-bit unsigned integers, and is often the UID assigned to the nfsnobody account. This suggested to me that there might be transient idmapd problems. That would be consistent with what I observed: sometimes the mail server would act like it didn't have rwx permissions via NFS, even after it had just created that file. Since only NFSv4 uses idmapd (at least for NFS versions), I switched to NFSv3 by changing a line in the /etc/fstab on the mailserver NFS client:

filehost:/mail/inbox      /var/spool/mail         nfs     defaults,vers=3   0 0

Then I rebooted the mail server, and voila! The NFS problems disappeared. For the record, I'd rebooted the mail server several times while diagnosing the problem, so this is not a case of "fixed by simple reboot."

Of course, this raises the issue of why idmapd has problems. Anyone curious can look at my idmapd.conf configuration above. But that's a separate question, and one of much lower priority for me. I may post that question on serverfault someday.

Later:

A quick web search gave me this: Partially incorrect uid mapping with nfs4/idmapd/ldap-auth

A fix was implement in kernel 3.13, but the current CentOS7 is kernel 3.10. I don't know if Redhat has backported the fix into their current CentOS7 kernel.

That clues me to what caused the problem: I'm constantly adding new active users to our cluster environment. At some point I must have tipped over the number of users in /var/spool/mail to trigger the idmapd bug.

  • Sounds like a coherent analysis to me. You should probably mark this answer as accepted so that this question no longer comes up as unresolved. Thanks for following up! – tripleee Apr 21 '17 at 13:32