Ray Out of Memory Error - Resources Seemingly not Used

Question

I am new to Ray. I am using it to perform concurrent data transformation on multiple datasets using Python.

I am running my head node using a VM with the following specs:

32 cores, 64GB RAM, 256 GB storage

I am getting the following error:

Traceback (most recent call last):
  ray.get(futures)
  File "/home/azureuser/env/ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
  File "/home/azureuser/env/ray/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
  raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::core() (pid=7201, ip=172.21.0.18)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 32-64gb-256 is used (60.16 / 62.8 GB). 
In addition, up to 0.0 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.

From another VM, I have added an extra node (which I can see when running ray status in the head node) with the following specs:

16 cores, 112GB RAM, 800 GB storage

I still get the same error of running out of memory for the 64GB. My intention is for Ray to use the cores and memory in both the head and extra nodes.

Since I can see the extra node in the list of available nodes, why is Ray seemingly not using that resource? I am understanding the way Ray works incorrectly?

Could you also share a bit more about your application? It would be great if you could provide code or pseudocode. — Stephanie Wang, Dec 01 '22 at 01:51

Ray Out of Memory Error - Resources Seemingly not Used

0 Answers0