I'm currently in charge of a rapidly-growing Hadoop cluster for my employer, currently built upon release 0.21.0 with CentOS as the OS for each worker and master node. I've worked through most of the standard configuration issues (load-balancing, IO planning for HDFS, ensuring enough disk space is available for spill operations, and so forth), but have found no good documentation on management of the number of file descriptors required for each task tracker, data node, mapper or reducer.
The documentation I've read so far (across Hadoop and HBase) vaguely points to the spill operation consuming a large number of descriptors simultaneously when it attempts to write to disk. This documentation of course provides no breakdown of the scope or expected lifetime of said descriptors. The only suggestion given has been to raise the system limit, which is plausible as a workaround and spurious as a strategy for long-term planning.
I have no information for what assumptions Hadoop makes regarding its number of required file descriptors. As a result, a configuration-dependent calculation of the total number of file descriptors required per mapper, reducer, task tracker, and data node over the lifetime of a normal job (that is, not dependent on MultipleOutputs) would be extremely useful.
Does such a calculation currently exist, and if so, can I make reasonable estimates as to what my limits should be relative to an arbitrary number of jobs as defined?
(To increase the likelyhood this question will be found by others experiencing this issue, Hadoop will happily throw java.io.EOFException and java.io.IOException (pointing to a Bad File Descriptor) when the pool of available descriptors has been exhausted. This took me multiple hours to track down, as the messages included with these exceptions are extremely generic.)