8

I know this has to just be a lack of my understanding but here's the problem.

We recently changed DNS servers from 192.168.1.1 to .2, so I went around to all 8 of my linux servers and changed /etc/resolv.conf to reflect the change. Note that they're all static, there's no DHCP involved.

After making the change I can immediately test the results using nslookup and dig, and it all looks good. I did a /etc/init.d/networking restart - to restart the networking subsystem - and restarted apache and postfix on each of the servers, just to be sure.

A few days later I get a report stating on of our websites isn't sending emails anymore. Perusing the logs I found that the mod_php process couldn't resolve dns entries to send mail. After beating my head on it for about 30 mins I rebooted the server and everything returned to normal.

The next day on a different server (using CentOS rather than our normal Ubuntu), I get a report stating that emails aren't going through, and sure enough looking at the logs indicates that Postfix can't resolve names. Rebooted and it almost instantly delivers all the queued mail.

So what am I missing here? What portion of this process did I fail to understand correctly?

Gray
  • 244
  • 3
  • 8

4 Answers4

11

You probably got bitten by nscd: http://linux.die.net/man/8/nscd

Cheers

HTTP500
  • 4,833
  • 4
  • 23
  • 31
  • Thanks! It seems highly possible this is what was giving me problems. I wasn't even aware local dns caching was part of the common linux system. – Gray Aug 31 '09 at 20:51
  • Did you actually test? Jason's hypothesis is possible but not certain. – bortzmeyer Sep 01 '09 at 08:01
  • @bortzmeyer - yes, I agree. Your own answer is the same as I would have given (and indeed have to two related questions recently). It's much more likely to be cached res_init() state than nscd. – Alnitak Sep 03 '09 at 12:39
8

Most applications intialize the resolver once, at startup (with res_init), and never do it again afterwards. It is not a problem for short-life applications like ping but more serious for long-running daemons.

The Apache process (which runs mod_php) was probably in that case. Restarting Apache would have suffice.

bortzmeyer
  • 3,941
  • 1
  • 21
  • 24
3

resolv.conf directs resolvers on where to look for names. In most cases, this is going to be the libc resolver, but there may be other cases such as vPostMaster which uses the Python DNS resolver library for SPF lookups.

So, it COULD be that the resolver is caching the resolv.conf information for long-running processes, but it sounded like you restarted postfix, which should have caused it to start using a fresh resolv.conf file.

Check your /etc/nsswitch.conf to see if it specifies anything special happening for "hosts". For example, the default Fedora 11 line on my laptop is:

hosts: files mdns4_minimal [NOTFOUND=return] dns

So in this case it uses mdns as well as /etc/hosts and DNS. In this case, if DNS changes weren't being picked up, I'd wonder if it were the mdns that were causing it.

Sean

Sean Reifschneider
  • 10,720
  • 3
  • 25
  • 28
1

Probably some caching going on. We had a similar problem with sendmail and just restarting the service fixed it.

Sometimes it's easier to just reboot the server and clear all those caches anywhere in the system than spend all that time identifying which service is caching too long. On the other hand, it can turn out to be an investment when it happens again and you know which service to restart.

jldugger
  • 14,342
  • 20
  • 77
  • 129
  • I agree, rebooting is the easiest way out, but if the server is critical finding time to reboot can be difficult. Thanks for your help! – Gray Aug 31 '09 at 20:52