We are facing a No space on device
error with Spark jobs running on our YARN cluster.
This has a few bad results. First, the Spark jobs take longer or fail. Second, since the disk fills up, the nodes are disabled by the YARN NodeManager and are removed from the pool and marked as unhealthy.
Is there a way to configure the maximum disk space that jobs are allowed to use on each NodeManager?
I'm hoping to be able to say something like "I have a disk of 1TB, you can use up to 900GB for jobs" and have YARN manage those resources is such a way that will never result in filling up the disk.
Alternatively, how can I make sure that YARN keeps removing old data from its local disk so it doesn't fill up? I don't care if that causes jobs to fail. That's inevitable when you overuse resources.