0

I spinned up a ray cluster on AWS to perform some grid search for my tf.keras model. And for almost all jobs (around 500) the workers throw an RayOutOfMemory exception after some iterations (between 6 and 53) and I can not figure out where the issue is nor can I detect any pattern. Each job takes less then 1 GiB memory when I run it locally. If the job itself would not fit into memory then I would expect the workers to raise the error right in the first iteration.

The cluster consists of: Head Node: m4.large (2 x vCPU, 8GiB Memory) Workers: 50 x m4.xlarge (4 x vCPU, 16 GiB Memory)

I set the resources for each worker to {'cpu': 2, 'memory': 4 * 1024 ** 3}.

I get the following error output:

ray.exceptions.RayTaskError: [36mray_worker[39m (pid=2170, host=ip-xxx-xxx-xxx-xxx)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/ray/memory_monitor.py", line 141, in raise_if_low_memory
    round(self.heap_limit / (1024**3), 4)))
ray.memory_monitor.RayOutOfMemoryError: Heap memory usage for ray_KerasTrainable_2170 is 4.2515 / 4.0039 GiB limit

Things I have observed: When I log into a worker node that raised the error I cannot see anything suspicious in (top/htop). Jobs scheduled after the error was raised can also run a couple of iterations before the error occurs. So the memory occupation seems to be a temporary thing.

I understand from the documentation that parts of the each worker's heap memory are used by the shared object store. Could the shared object store for the 50 workers grow such that it eats up all of the memory of the workers at some point after a couple of iterations?

I also increased the memory of each worker to 6 GiB but then the workers run out of memory just a little bit later.

Does anybody have an explanation and a solution for this problem?

Denis
  • 13
  • 2
  • can you provide a script for reproduction? what happens when you set memory to be 8*1024**3? – richliaw Sep 07 '19 at 14:19
  • As I said, increasing the memory just delays the effect. I tried it with 6 * 1024 ** 3 and then I can run a couple of iterations more before the error occurs. I will try to produce a minimal script that reproduces the error this week. – Denis Sep 09 '19 at 06:16
  • Just now the following log message of `ray` caught my attention: `Starting the Plasma object store with 5.15 GB memory using /tmp` Does that mean that everything that is stored in `/tmp` will be loaded in memory? This would explain why my workers run out of memory since I normally store my experiment results in `/tmp` because I upload them anyway to S3 automatically from that location. – Denis Sep 09 '19 at 12:38
  • I finally figured out what the reason for the constant memory growth is. `ray` ist not responsible for it but it seems to me that `tensorflow`/`tf.keras` has a memory leak when compiled in a certain way. I have reported the issue here: https://github.com/tensorflow/tensorflow/issues/32394. Now that I uninstall the optimized tensorflow version and reinstall just normally with `pip` the jobs run without error. – Denis Sep 11 '19 at 09:28

0 Answers0