From one timepoint, our JVM(In fact a Yarn NodeManager) start to report UnknownHostException; It is reported by JVM code
return InetAddress.getByName(host);
for the next more than 2 days, the exception always exists; During the time it is reporting this error, I do the following test:
- During the error happening, ping could succeed and get the IP address(Very weird);
- During the error, I write a simple test case to check the hostname resolve, it also could succeeded:
- After we restarted the JVM, error is gone;
This is the code I used for test:
public class Main {
public static void main(String[] args){
InetSocketAddress addr = NetUtils.createSocketAddr("host-name:8020");
System.out.println(addr.isUnresolved());
}
}
# NetUtils is a YARN class which simply call the InetAddress.getByName()
public static InetSocketAddress createSocketAddrForHost(String host, int port) {
String staticHost = getStaticResolution(host);
String resolveHost = (staticHost != null) ? staticHost : host;
InetSocketAddress addr;
try {
InetAddress iaddr = SecurityUtil.getByName(resolveHost);
// if there is a static entry for the host, make the returned
// address look like the original given host
if (staticHost != null) {
iaddr = InetAddress.getByAddress(host, iaddr.getAddress());
}
addr = new InetSocketAddress(iaddr, port);
} catch (UnknownHostException e) {
addr = InetSocketAddress.createUnresolved(host, port);
}
return addr;
}
We haven't change the /etc/hosts for a long time;
ENVs: JDK: java version "1.8.0_121" OS:
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
I believe that in the timepoint when the error start to occur, yes, the network has some problem. But what is weird is that:
- why it cannot recover after the network is back(For example, when I find this error and do some test and ping). In fact the network problem happened for only 30 minutes, but the JVM still report these error;
- why the problem is gone after I restart the JVM?
I checked the JVM configuration , the networkaddress.cache.ttl
and networkaddress.cache.negative.ttl
are all default value; So, when we find the unresolved hostname, I should retry and it should succeed after the network is back;