This is something that I've found nowhere.
I have a YARN cluster with some slaves. When a slave fails (chaos monkey, scale down, etc.), ResourceManager doesn't "get it". Even a rmadmin -refreshNodes
doesn't fix it. ResourceManager keeps listing the failed nodes as RUNNING
. How do I do in order for ResourceManager to check for slaves health and remove them when they fail?