1

The last few days, our hosting provider has had occasional connectivity issues with one of their upstreams. Each time this happens, we end up with a heap of outbound emails that Sendmail gives up on.

550 5.1.2 <*********@hotmail.com>... Host unknown (Name server: hotmail-com.olc.protection.outlook.com.: host not found)

Final-Recipient: RFC822; *********@hotmail.com
Action: failed
Status: 5.1.2
Remote-MTA: DNS; hotmail-com.olc.protection.outlook.com
Last-Attempt-Date: Tue, 17 Nov 2020 04:44:51 +0100

I first thought that maybe we were getting NXDOMAIN from our upstream DNS server, and this made Sendmail treat the error as permanent. So, I changed to use 9.9.9.9 instead of the hosting provider DNS servers. Overnight, they had another brief outage as they restarted a BGP session, and the same thing happened again.

Does anyone know what's going on here, and what's the expected behaviour? I've tried searching, but maybe I haven't come up with the right things to search for.

It seems to me, that the sensible thing to do when there's a DNS connectivity issue, would be put the emails in the queue for retrying later, just as when there's a problem talking to the remote server (temporary email, or connectivity issue). This also seems to be what RFC 5321 specifies.

So, the way I understand it: If the domain does not exist (NXDOMAIN), then treat as a permanent failure and give up. If there is no response from DNS, or the DNS server fails (SERVFAIL), then re-queue.

I'm not sure if this is really a DNS issue or a Sendmail issue. I can't find any relevant resolver settings, so I'm guessing that it's Sendmail that would need to be configured to retry when it can't find a host, if this is not the default.

The server in question runs sendmail-8.14.7-6.el7.x86_64 on CentOS 7.9.2009

Any idea what's going on?

Although the majority of our users use Gmail, these issues only seem to affect recipient domains hosted with charter.net or Microsoft.

The number at the beginning of each line below is the number of failures for that domain.

 73 Host unknown (Name server: hotmail-com.olc.protection.outlook.com.: host not found)
 10 Host unknown (Name server: pkvw-mx.msg.pkvw.co.charter.net.: host not found)
  8 Host unknown (Name server: msn-com.olc.protection.outlook.com.: host not found)
  6 Host unknown (Name server: live-com.olc.protection.outlook.com.: host not found)
  4 Host unknown (Name server: outlook-com.olc.protection.outlook.com.: host not found)

Full log of an example:

Nov 17 04:46:26 llama sendmail[19358]: 0AH3kQNO019355: to=<*********@hotmail.com>, delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=133505, relay=hotmail-com.olc.protection.outlook.com., dsn=5.1.2, stat=Host unkno
wn (Name server: hotmail-com.olc.protection.outlook.com.: host not found)
Nov 17 04:46:26 llama sendmail[19358]: 0AH3kQNO019355: 0AH3kQNN019358: DSN: Host unknown (Name server: hotmail-com.olc.protection.outlook.com.: host not found)
Hungrig
  • 11
  • 2
  • could you include the relevant sendmail log entries? – AnFi Nov 17 '20 at 13:04
  • @AnFi I'd pasted in what I figured was relevant, since the rest is basically just timestamps, message IDs, hostname and PID. Sendmail is running with standard log level, as these problems weren't expected beforehand. – Hungrig Nov 17 '20 at 14:02

1 Answers1

0

RFC 3463 indicates that this particular situation is a permanent failure:

      X.1.2   Bad destination system address

         The destination system specified in the address does not exist
         or is incapable of accepting mail.  For Internet mail names,
         this means the address portion to the right of the "@" is
         invalid for mail.  This code is only useful for permanent
         failures.

Indeed, the mail server has no way to know that the failure of DNS resolution is temporary and would succeed if retried after some interval, rather than the user making a typo, by far the more common case. Should a user have to wait five days to find out they misspelled the domain name? Moreover, such a problem with the DNS ought not to be hidden; rather it should be investigated and (if it actually is a problem) fixed as soon as possible.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
  • Surely, the mail server knows this, from the DNS response? If there's no data, and it's flagged with NXDOMAIN, then the destination is invalid. However, if there's a DNS timeout, or a SERVFAIL, then the problem should be assumed to be recoverable. The way I read the RFCs, and the way I always understood this, the system was designed with these issues in mind. – Hungrig Nov 17 '20 at 08:39
  • And in response to waiting for 5 days - most well behaved MTAs will generate a warning after 4 hours if they can't deliver the email. This is exactly for the reasons you state. – Hungrig Nov 17 '20 at 08:43
  • @Hungrig Depends on the implementation. That level of detail might not be available to the MTA or it might ignore it. You'll have to check its source and possibly the resolver to be sure. – Michael Hampton Nov 17 '20 at 08:53
  • I'm surprised that I can't find anything when searching for information about this, but the more I think about it, the more convinced I am that this can't be the expected behaviour. I've been running Sendmail for over 20 years, and don't think I've ever seen a permanent delivery failure in response to temporary network issues before. – Hungrig Nov 17 '20 at 09:58
  • Back in the day, connectivity was expected to be patchy, and systems were designed to cope with this. Even today, networks and DNS aren't 100% reliable, and on a high-volume system, permanently rejecting everything when there's a network outage lasting for a few minutes, would be really bad. The server in question only sends 5-6000 emails per day, and in total has rejected a bit over 100 auto generated emails. Imagine a busy ISP server, rejecting thousands of emails back to their customer each time there's a brief network glitch. – Hungrig Nov 17 '20 at 10:00
  • 20 years ago is about when I gave up on sendmail for good, so I dunno... – Michael Hampton Nov 17 '20 at 10:08
  • Haha, fair enough :-) I've never had any trouble that's been serious enough to motivate switching, and my Sendmail configs tend to look pretty much the same today as they did back then. – Hungrig Nov 17 '20 at 10:32
  • Here's another theory. The search list (as defined by resolv.conf) contains our own domain. After failing to look up an A record for outlook-com.olc.protection.outlook.com, did it then attempt outlook-com.olc.protection.outlook.com.example.com (with example.com being our domain)? If there were partial connectivity issues rather than a full outage, then this could also explain why it only affected some domains. Still seems weird though, as if this was the case, then I'd expect the same thing to happen more frequently, so I still suspect that something isn't behaving as it should. – Hungrig Nov 17 '20 at 14:24
  • A similar issue, as described by Cloudflare: https://blog.cloudflare.com/debugging-war-story-the-mystery-of-nxdomain/ – Hungrig Nov 17 '20 at 17:48
  • @Hungrig That could be. Current best practice is to not set the search domain and use FQDNs for everything. – Michael Hampton Nov 17 '20 at 21:38