Background
We have a small headless box running Linux kernel 2.6.35 and some variant of Open Embedded distribution on ARM hardware.
As far as we can tell we are using glibc 2.10.1.
The box has an unconnect ethernet and a serial attached GSM/3G modem. We have PPP configured to ensure continues connection to the internet. This part works without problems.
We have a program coded in c (actually c++) that makes some connection using sockets. The program is heavily multi-threaded using pthreads.
To lookup the IP address to connect to we use gethostbyname().
When there is no connection to the internet, e.g. during initial boot or when the SIM card is removed from the modem, gethostbyname() returns NULL as it should.
The symptom
But occationally gethostbyname() keeps returning NULL, even though the internet connection is up and running.
Error code from getaddrinfo() when using that is EAI_NONAME ~ "Name or service not known". We do not have the error code from gethostbyname() at hand but it was the equivalent.
Our analysis
We have ensured the internet connection is ok by (via a serial console)
- List item
- Looking through /var/log/messages and ensured pppd says all ok
- ping the hostname (translates to an IP and replies ok)
- connect to the box via ssh via the public IP
We have two threads in the process that use gethostbyname() for the same host. They have slightly differect code-paths and functions but use common code for the socket functions, including the part that calls gethostbyname().
In the situations where gethostbyname() keeps returning NULL this is usually only true for ONE of the threads and not the same one every time. The other makes the lookup perfectly.
Furthermore, the thread with the failing gethostbyname() can be easily brought to function by a simple controlled stop of that thread and restart of the function which then results in a new thread pthread-wise.
In total we are convinced that DNS translation and internet connection are functioning fine at the OS level.
To rule out threading problems we have re-implemented the lookup code using getaddrinfo() which is reentrant according to the man page. And with the exact same result.
To us is seems that exit of the thread results in some kind of cleanup that affects the ability of gethostbyname()/getaddrinfo() to do lookups.
A workaround would of cause be to enforce exit of the failing thread but this would mean a major change in application structure and is not really an option.
The Question
So the question is: Do you have any pointers where to look for a solution or where the real problem might be?