YARN workers running out of disk space

Question

We are facing a No space on device error with Spark jobs running on our YARN cluster.

This has a few bad results. First, the Spark jobs take longer or fail. Second, since the disk fills up, the nodes are disabled by the YARN NodeManager and are removed from the pool and marked as unhealthy.

Is there a way to configure the maximum disk space that jobs are allowed to use on each NodeManager?

I'm hoping to be able to say something like "I have a disk of 1TB, you can use up to 900GB for jobs" and have YARN manage those resources is such a way that will never result in filling up the disk.

Alternatively, how can I make sure that YARN keeps removing old data from its local disk so it doesn't fill up? I don't care if that causes jobs to fail. That's inevitable when you overuse resources.

Haven't used it myself but `yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb` might work. https://hadoop.apache.org/docs/r2.8.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml — tk421, Jan 17 '19 at 23:18
@tk421, can you explain how this configuration differs from `yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage`? — summerbulb, Jan 21 '19 at 07:33
That's a percentage, this is a specific amount in MB. When you have TB disk drives, percentages can skew especially if you don't have equally sized partitions. — tk421, Jan 23 '19 at 05:16
@tk421, thanks. The problem with both configurations is that they won't work once a job is already running, only prevent YARN from giving _new_ jobs on that node-manager. — summerbulb, Jan 23 '19 at 12:58
Yeah, in your case, you're dumping so much stuff to the local dir that you're filling your disks. You might have to add more stages to your Spark jobs to shrink the intermediate data. — tk421, Jan 23 '19 at 18:11

YARN workers running out of disk space

0 Answers0