3

Background

We have a small headless box running Linux kernel 2.6.35 and some variant of Open Embedded distribution on ARM hardware.

As far as we can tell we are using glibc 2.10.1.

The box has an unconnect ethernet and a serial attached GSM/3G modem. We have PPP configured to ensure continues connection to the internet. This part works without problems.

We have a program coded in c (actually c++) that makes some connection using sockets. The program is heavily multi-threaded using pthreads.

To lookup the IP address to connect to we use gethostbyname().

When there is no connection to the internet, e.g. during initial boot or when the SIM card is removed from the modem, gethostbyname() returns NULL as it should.

The symptom

But occationally gethostbyname() keeps returning NULL, even though the internet connection is up and running.

Error code from getaddrinfo() when using that is EAI_NONAME ~ "Name or service not known". We do not have the error code from gethostbyname() at hand but it was the equivalent.

Our analysis

We have ensured the internet connection is ok by (via a serial console)

  • List item
  • Looking through /var/log/messages and ensured pppd says all ok
  • ping the hostname (translates to an IP and replies ok)
  • connect to the box via ssh via the public IP

We have two threads in the process that use gethostbyname() for the same host. They have slightly differect code-paths and functions but use common code for the socket functions, including the part that calls gethostbyname().

In the situations where gethostbyname() keeps returning NULL this is usually only true for ONE of the threads and not the same one every time. The other makes the lookup perfectly.

Furthermore, the thread with the failing gethostbyname() can be easily brought to function by a simple controlled stop of that thread and restart of the function which then results in a new thread pthread-wise.

In total we are convinced that DNS translation and internet connection are functioning fine at the OS level.

To rule out threading problems we have re-implemented the lookup code using getaddrinfo() which is reentrant according to the man page. And with the exact same result.

To us is seems that exit of the thread results in some kind of cleanup that affects the ability of gethostbyname()/getaddrinfo() to do lookups.

A workaround would of cause be to enforce exit of the failing thread but this would mean a major change in application structure and is not really an option.

The Question

So the question is: Do you have any pointers where to look for a solution or where the real problem might be?

Nicolai Henriksen
  • 1,324
  • 1
  • 13
  • 37
  • is it failing on the same hostname lookup, or on 'random' names? DNS uses UDP by default, which can silently lose packets. But DNS can also use TCP if the response would be too large for a single UDP package, and if DNS/TCP is firewalled, you'll get no response and a failed lookup as well – Marc B Feb 03 '14 at 16:45
  • If you do a DNS query from the command line using nslookup (e.g. nslookup google.com), does it resolve? If so, this might point to a problem internal to your program. If not, this might point to some underlying DNS problem on your platform. – mti2935 Feb 03 '14 at 17:12
  • We only have one specific hostname and have not tried any else as this is the particular host we need to connect to. But as mentioned it does not fail consistently. A different thread in the same process can do the lookup with no problems and a restart of the thread recovers the situation. – Nicolai Henriksen Feb 03 '14 at 17:15
  • @mti2935 We are very sure it is an internal problem. Lookups from the command line works. – Nicolai Henriksen Feb 03 '14 at 17:18
  • 1
    What kind of libc (glibc/eglibc/uclibc/etc.) and version are you using ? Also what kind of error is reported (with errno or return value of getaddrinfo ) – nos Feb 03 '14 at 18:27
  • @nos The error code from getaddrinfo() is EAI_NONAME ~ "Name or service not known". The error from gethostbyname() was equivalent but I forgot to record it. – Nicolai Henriksen Feb 04 '14 at 08:20
  • @nos I think the we use glibc version 2.10.1 but the bitbake build environment is complicated enough for me to be in doubt. ldd reports dependency on libc.so.6 – Nicolai Henriksen Feb 04 '14 at 08:21

1 Answers1

-3
    char *hostname = "www.example.com";
    struct hostent *a_server;
    a_server=gethostbyname(hostname);
    while (a_server == NULL) {
            a_server=gethostbyname(hostname);
            sleep(1);
    }
  • 1
    Although this code may help to solve the problem, it doesn't explain _why_ and/or _how_ it answers the question. Providing this additional context would significantly improve its long-term educational value. Please [edit] your answer to add explanation, including what limitations and assumptions apply. – Toby Speight Aug 23 '16 at 10:04