2

I'm currently in charge of a rapidly-growing Hadoop cluster for my employer, currently built upon release 0.21.0 with CentOS as the OS for each worker and master node. I've worked through most of the standard configuration issues (load-balancing, IO planning for HDFS, ensuring enough disk space is available for spill operations, and so forth), but have found no good documentation on management of the number of file descriptors required for each task tracker, data node, mapper or reducer.

The documentation I've read so far (across Hadoop and HBase) vaguely points to the spill operation consuming a large number of descriptors simultaneously when it attempts to write to disk. This documentation of course provides no breakdown of the scope or expected lifetime of said descriptors. The only suggestion given has been to raise the system limit, which is plausible as a workaround and spurious as a strategy for long-term planning.

I have no information for what assumptions Hadoop makes regarding its number of required file descriptors. As a result, a configuration-dependent calculation of the total number of file descriptors required per mapper, reducer, task tracker, and data node over the lifetime of a normal job (that is, not dependent on MultipleOutputs) would be extremely useful.

Does such a calculation currently exist, and if so, can I make reasonable estimates as to what my limits should be relative to an arbitrary number of jobs as defined?

(To increase the likelyhood this question will be found by others experiencing this issue, Hadoop will happily throw java.io.EOFException and java.io.IOException (pointing to a Bad File Descriptor) when the pool of available descriptors has been exhausted. This took me multiple hours to track down, as the messages included with these exceptions are extremely generic.)

MrGomez
  • 163
  • 6

1 Answers1

2

This is a major source of problems in the Hadoop ecosystem and AFAIK there isn't a good answer to comprehensive planning for this kind of resource. Overall, the isn't an Enterprise quality Hadoop distribution that will support the laudable level of diligence that you are applying to your system.

I am pretty sure that there will be one in the next few months, however.

Ted Dunning
  • 306
  • 1
  • 6
  • Implied +1. That seems to be my general read of the situation as well. I'm considering throwing the question at StackOverflow, simply to see if anyone knows what finite operations cause file descriptor overkill. MultipleOutputs is one of the more obvious ones. – MrGomez Dec 04 '10 at 20:38