13

I have a small Java program that loops calling InetAddress.getByName("example.com") every second. When I run it on a CentOS 6.4 box using 'strace -f' I see that /etc/resolv.conf is opened and read once:

$ grep /etc/resolv.conf strace.out
[pid 24810] open("/etc/resolv.conf", O_RDONLY) = 6

When I run it on Debian 7 I see that /etc/resolv.conf is repeatedly opened or stat()'d:

$ grep  /etc/resolv.conf strace.out
[pid 41821] open("/etc/resolv.conf", O_RDONLY) = 10
[pid 41821] stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=92, ...}) = 0
[pid 41821] open("/etc/resolv.conf", O_RDONLY) = 10
[pid 41821] stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=92, ...}) = 0
[pid 41821] stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=92, ...}) = 0

Both systems have /etc/nsswitch.conf configured with

hosts: files dns

Neither system has a name caching daemon running.

I used the same version of the Oracle HotSot Java JVM on both machines to rule out any Java differences.

The CentOS 6.4 box has glibc 2.12 installed. The Debian 7 box has glibc 2.13 installed.

What accounts for the different behavior between the two operating systems with regards to opening and reading /etc/resolv.conf?

user1311618
  • 133
  • 4

2 Answers2

10

The RedHat glibc developers consider some bugs in their software not to be bugs. One of these bugs is the re-reading of resolv.conf after changing. glibc considers that the responsibility of the application, so each and every application will need to create its own logic for this.

Because this is absolutely bonkers, the eglibc developers have fixed this issue. So on non-eglibc systems your application will need to have its own logic for reinitializing nss_dns, or else it will need to be restarted after a resolv.conf change. On eglibc systems (Debian and things based on Debian), you get a less buggy libc.

We found this out the hard way after changing resolv.conf, decommissioning old DNS servers and then having to restart 1200+ mysql servers. Needless to say, this is not fun.

Dennis Kaarsemaker
  • 19,277
  • 2
  • 44
  • 70
  • Why is this considered "absolutely bonkers"? And why did glibc do it this way? – Michael Hampton Dec 24 '13 at 16:06
  • 1
    Because instead of fixing glibc, they place the burden on *every application* out there... As for why they do it? I don't know. I can't read Dreppers mind, and I'm not sure I want to know what goes on in there... – Dennis Kaarsemaker Dec 24 '13 at 16:26
  • 1
    The thing is: I'm not sure that glibc is actually broken. Why must `/etc/resolv.conf` be re-read at every DNS lookup? Is it really expected to change that frequently? Now if the behavior was _undocumented_ then I could understand... – Michael Hampton Dec 24 '13 at 17:18
  • 1
    It's not reread at every lookup, that would be broken as well :) The behaviour is undocumented and really counterintuitive: glibc takes the responsibility for initializing the nss_dns library, but subsequently makes the application responsible for reinitializing it, even though those applications don't know anythong about nss and how it works. How is that not bonkers? – Dennis Kaarsemaker Dec 24 '13 at 17:44
  • 1
    Dennis is right, gai in EL6 is intentionally broken because the buggy behaviour has become the "expected behaviour"- https://access.redhat.com/site/solutions/541163 – suprjami Dec 24 '13 at 22:09
  • I'm inclined to agree with Dennis here. We were recently bitten by this in production as well, and none of the manpages for `getaddrinfo(3)`, `gethostbyname(3)`, etc. so much as reference `res_init(3)`, let alone the fact that programs are expected to call it in order to pick up an `/etc/resolv.conf` change. `/etc/hosts` *is* reread between DNS queries, but `/etc/resolv.conf` is not...the larger picture is very counter-intuitive from both a troubleshooting and development standpoint. – Andrew B Jun 27 '15 at 18:39
4

Not only are the C library versions different, but CentOS uses the GNU C library (glibc) whereas Debian uses Embedded GLIBC (eglibc), so the actual implementation of the name lookup system calls is completely different.

That would probably account for different system call behaviour between these two distributions.

I assume InetAddress.getByName translates into getaddrinfo(). You could start by reading the source of each syscall in the relevant C library implementation and versions.

Make sure you read the source from the actual package versions you are using. The packages in EL 6.4 have had over 2 years of improvements done compared to their original upstream versions. I assume the same is true of the Debian packages.

suprjami
  • 3,536
  • 21
  • 29