My java process's file descriptors going "bad" and I have no idea why

Question

I have a java webapp, built with Lucene, and I keep getting various "file already closed" exceptions - depending on which Directory implementation I use. I've been able to get "java.io.IOException Bad File Descriptor" and "java.nio.channels.ClosedChannelException" out of Lucene, usually wrapped around an AlreadyClosedException for the IndexReader.

The funny thing is, I haven't closed the IndexReader and it seems the file descriptors are going stale on their own. I'm using the latest version of Lucene 3.0 (haven't had time to upgrade out of the 3.0 series), the latest version of Oracle's JDK6, the latest version of Tomcat 6 and the latest version of CentOS. I can replicate the bug with the same software on other Linux systems, but not on Windows systems and I don't have an OSX PC to test with. The linux servers are virtualized with qEmu, if that could matter at all.

This seems to also be load related - how frequently this happens corresponds to the amount of requests/second that Tomcat is serving (to this particular webapp). For example, on one server every request completes as expected until it has to deal with ~2 reqs/sec, then about 10% start having their file descriptors closed from under them, mid-request (the code checks for a valid IndexReader object and creates one at the beginning of processing the request). Once it gets to about 3 reqs/sec, all of the requests start failing with bad file descriptors.

My best guess is that somehow there's resource starvation at an OS level and the OS is cleaning up fds... but that's simply because I've eliminated every other idea I've had. I've already checked the ulimits and the filesystem fd limits and the number of open descriptors is well below either limit (example output from sysctl fs.file-nr: 1020 0 203404, ulimit -n: 10240).

I'm almost completely out of things to test and I'm no closer to solving this than the day that I found out about it. Has anyone experienced anything similar?

EDIT 07/12/2011: I found an OSX machine to use for some testing and have confirmed that this happens on OSX. I've also done testing on physical Linux boxes and replicated the issue, so the only OS that I've been unable to replicate this issue with is Windows. I'm guessing this has something to do with POSIX handling of file descriptors because that seems to be the only relevant difference between the two test systems (JDK version, tomcat version and webapp were all identical across all platforms).

No. The filesystem on the virtualized server is ext3, as is the filesystem on the host machine. There's no NFS anywhere involved. — oorza, Jul 11 '11 at 19:51
Given that I've now replicated this in OSX (HFS) and on a physical linux box using ext4, I think it's probably safe to rule out the filesystem as a culprit. — oorza, Jul 12 '11 at 20:16
I was going to post this as an answer, then decided against it: No. I have not experienced anything similar. — MirroredFate, Jul 12 '11 at 20:39
We had a very similar problem a couple years back with Lucene. The issue was race conditions between user threads calling close and the finalizer thread double closing the file descriptors. It doesn't sound like your environment is at all the same but the following may be of interest: http://256.com/gray/docs/misc/java_bad_file_descriptor_close_bug.shtml — Gray, Jul 12 '11 at 20:48
Do you use IndexReader.reopen? If so, can you try swapping it with a new IndexReader? — mindas, Jul 12 '11 at 21:19

score 2 · Answer 1 · answered Jul 18 '11 at 06:32

the reason you probably don't see this happening on Windows, might be that its FSDirectory.open defaults to using SimpleFSDirectory.

check out the warnings at the top of FSDirectory and NIOFSDirectory: the text in red at http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/store/NIOFSDirectory.html:

NOTE: Accessing this class either directly or indirectly from a thread while it's interrupted can close the underlying file descriptor immediately if at the same time the thread is blocked on IO. The file descriptor will remain closed and subsequent access to NIOFSDirectory will throw a ClosedChannelException. If your application uses either Thread.interrupt() or Future.cancel(boolean) you should use SimpleFSDirectory in favor of NIOFSDirectory

https://issues.apache.org/jira/browse/LUCENE-2239

I can reproduce the problem with NIOFSDirectory, SimpleFSDirectory, and MMapDirectory. I've talked to my boss and we've decided that the best solution seems to be to take the webapp rewrite that was planned for the distant future and do it now. For curiosity's sake though, I went through and removed all close() calls, there are no Thread.interrupt() calls or Future.cancel() calls. There's a comment on the original post that describes a similar condition caused by a JDK bug and because I can't reproduce this in all operating systems, I'm inclined to believe that's what it is. — oorza, Jul 19 '11 at 18:16

My java process's file descriptors going "bad" and I have no idea why

1 Answers1