I work for some administration. We are responsible for app development, I'm personnaly responsible for the software application servers (Glassfish) and there is a team which manages the infrastructure (network, load balancer, oracle db, physical servers (Solaris on x86 machines)).
Now, sometimes, things get wrong and our application stops working. I try to take a lot of traces whenever there is a problem:
- from the jvm running the appserver: jstack -l
- from the system: prstat, pstack, vmstat, netstat
But, as it often happens in production, I don't have too much time and have to bounce the server asap.
Now, some demo server which isn't too much used crashed (saw it this morning, could be a couple of days ago). It seems to have the same symptoms as our unexplained app crash:
- app server seems to wait for something (in production, it seems to be related to some db connection)
- dba doesn't see anything on db side
- cpu levels stay low, other things just work (web admin console, ...)
The oracle db connection uses an ldap lookup. On our demo server, it seems to be stuck waiting for an ldap connection to do something. Network team doesn't see any connection on the other side. Now, in netstat (from the zone I'm root), I definitely can see the connection as ESTABLISHED.
My question is: how can a connection become ESTABLISHED, sitting there, waiting for something, and why can't they see anything from the other side.
I'm guessing that, if this can happen for an ldap connection, it could happen to anything (db connection, ...).
The server is still in this state, so I can experiment on it (and give some more data) for some time (not too long).