I spinned up a ray cluster on AWS to perform some grid search for my tf.keras
model. And for almost all jobs (around 500) the workers throw an RayOutOfMemory
exception after some iterations (between 6 and 53) and I can not figure out where the issue is nor can I detect any pattern. Each job takes less then 1 GiB memory when I run it locally. If the job itself would not fit into memory then I would expect the workers to raise the error right in the first iteration.
The cluster consists of: Head Node: m4.large (2 x vCPU, 8GiB Memory) Workers: 50 x m4.xlarge (4 x vCPU, 16 GiB Memory)
I set the resources for each worker to {'cpu': 2, 'memory': 4 * 1024 ** 3}
.
I get the following error output:
ray.exceptions.RayTaskError: [36mray_worker[39m (pid=2170, host=ip-xxx-xxx-xxx-xxx)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/memory_monitor.py", line 141, in raise_if_low_memory
round(self.heap_limit / (1024**3), 4)))
ray.memory_monitor.RayOutOfMemoryError: Heap memory usage for ray_KerasTrainable_2170 is 4.2515 / 4.0039 GiB limit
Things I have observed: When I log into a worker node that raised the error I cannot see anything suspicious in (top/htop). Jobs scheduled after the error was raised can also run a couple of iterations before the error occurs. So the memory occupation seems to be a temporary thing.
I understand from the documentation that parts of the each worker's heap memory are used by the shared object store. Could the shared object store for the 50 workers grow such that it eats up all of the memory of the workers at some point after a couple of iterations?
I also increased the memory of each worker to 6 GiB but then the workers run out of memory just a little bit later.
Does anybody have an explanation and a solution for this problem?