5

I am trying out the promising multiprocessing package ray. I have a problem I seem not be able to solve. My program runs fine the first time, but at a second run this Exception is raised on the ray.put() line:

ObjectStoreFullError: Failed to put object ffffffffffffffffffffffffffffffffffffffff010000000c000000 in object store because it is full. Object size is 2151680255 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.

What do I want to do: In my actual code (I'm planning to write) I need to process many big_data_objects sequentially. I want to hold one big_data_object in memory at a time and do several heavy (independent) computations on the big data object. I want to execute these computations in parallel. When these are done, I have to replace these big_data_object in the object store by new ones and start the computations (in parallel) again.

Using my test script I simulate this by starting the script again without ray.shutdown(). If I shutdown ray using ray.shutdown() the object store is cleared but then reinitializing takes a long time and I cannot process multiple big_data_object sequentially as I want to.

What sources of information have I studied: I studied this document Ray Design Patterns and studied the section 'Antipattern: Closure capture of large / unserializable object' and how to the proper pattern(s) should look like. I also studied the getting started guide which lead to the following test script.

A minimum example to reproduce the problem: I created a test script to test this. It is this:

#%% Imports
import ray
import time
import psutil
import numpy as np


#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
    ray.init(num_cpus=num_cpus, include_dashboard=False)

# Define function to do work in parallel
@ray.remote
def my_function(x):  # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
    time.sleep(1)
    data_item = ray.get(big_data_object_ref)
    return data_item[0,0]+x

# Define large data
big_data_object = np.random.rand(16400,16400)  # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600)  # Define an object of approx 1 GB.
# big_data_object = np.random.rand(8100,8100)  # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000)  # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)

# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
    result_refs.append(my_function.remote(item))
    
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))


#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
for index in range(4):
    del result_refs[0]

del big_data_object_ref

Where do I think it's going wrong: I think I delete all the references to the object store in the end of the script. As a result the objects should be cleared from the object store (as described here). Apparently, something is wrong because the big_data_object remains in the object store. The results are deleted from the object store just fine, however.

Some debug information: I inspected the object store using ray memory command, this is what I get:

(c:\python\cenv38rl) PS C:\WINDOWS\system32> ray memory
---------------------------------------------------------------------------------------------------------------------
 Object ID                                                Reference Type       Object Size   Reference Creation Site
=====================================================================================================================
; worker pid=20952
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=29368
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=17388
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=24208
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=27684
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=6860
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; driver pid=28684
ffffffffffffffffffffffffffffffffffffffff010000000b000000  LOCAL_REFERENCE       2151680261   c:\python\cenv38rl\lib\site-packages\ray\worker.py:put_object:277 | c:\python\cenv38rl\lib\site-packages\ray\worker.py:put:1489 | c:\python\cenv38rl\lib\site-packages\ray\_private\client_mode_hook.py:wrapper:47 | C:\Users\Stefan\Documents\Python examples\Multiprocess_Ray3_SO.py:<module>:42
---------------------------------------------------------------------------------------------------------------------

--- Aggregate object store stats across all nodes ---
Plasma memory usage 2052 MiB, 1 objects, 77.41% full

Some of the things I have tried: If, I replace my_function for:

@ray.remote
def my_function(x):  # Later I will have multiple different my_functions to extract separate feature from my big_data_objects
    time.sleep(1)
    # data_item = ray.get(big_data_object_ref)
    # return data_item[0,0]+x
    return 5

and then the script successfully clears the object store but my_function cannot use the big_data_object which I need it to.

My question is: How to fix my code so that the big_data_object is removed from the object store at the end of my script without shutting down ray?

Note: I installed ray using pip install ray which gave me version ray==1.2.0 which I am using now. I use ray on Windows and I develop in Spyder v4.2.5 in a conda (actually miniconda) environment in case it is relevant.

EDIT: I have tested also on a Ubuntu machine with 8GB RAM. For this I used the big_data_object of 1GB. I can confirm the issue also occurs on this machine.

The ray memory output:

(SO_ray) stefan@stefan-HP-ZBook-15:~/Documents/Ray_test_scripts$ ray memory 
---------------------------------------------------------------------------------------------------------------------
 Object ID                                                Reference Type       Object Size   Reference Creation Site
=====================================================================================================================
; worker pid=18593
ffffffffffffffffffffffffffffffffffffffff0100000001000000  LOCAL_REFERENCE       1076480259   /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18591
ffffffffffffffffffffffffffffffffffffffff0100000001000000  LOCAL_REFERENCE       1076480259   /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18590
ffffffffffffffffffffffffffffffffffffffff0100000001000000  LOCAL_REFERENCE       1076480259   /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; driver pid=17712
ffffffffffffffffffffffffffffffffffffffff0100000001000000  LOCAL_REFERENCE       1076480259   (put object)  | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:wrapper:47 | /home/stefan/Documents/Ray_test_scripts/Multiprocess_Ray3_SO.py:<module>:43 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/spyder_kernels/customize/spydercustomize.py:exec_code:453
; worker pid=18592
ffffffffffffffffffffffffffffffffffffffff0100000001000000  LOCAL_REFERENCE       1076480259   /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
---------------------------------------------------------------------------------------------------------------------

--- Aggregate object store stats across all nodes ---
Plasma memory usage 1026 MiB, 1 objects, 99.69% full

I have to run the program in Spyder so that after execution of the program I can inspect the object store's memory using ray memory. If I run the program in PyCharm for example, ray is automatically terminated as the script is completed so I cannot check if my script clears the object store as intended.

Stefan
  • 919
  • 2
  • 13
  • 24
  • 1
    Hmm this is interesting. Would you mind doing 2 things? 1. Is it possible for you to try with Mac or Linux machine? It is just to figure out if the issue is due to some subtle Windows bugs 2. Also, do you mind reposting it to our Github homepage? I'd love to try reproducing it and see the problem here; https://github.com/ray-project/ray/issues – Sang Mar 31 '21 at 18:35
  • @Sang , I posted the same on you GitHub page: https://github.com/ray-project/ray/issues/15058 – Stefan Apr 01 '21 at 09:02

1 Answers1

4

The problem is that your remote function captures big_data_object_ref , and the reference from there is never removed. Note that when you do this type of thing:

# Define function to do work in parallel
@ray.remote
def my_function(x):  # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
    time.sleep(1)
    data_item = ray.get(big_data_object_ref)
    return data_item[0,0]+x

# Define large data
big_data_object = np.random.rand(16400,16400)
big_data_object_ref = ray.put(big_data_object)

big_data_object_ref is serialized to the remote function definition. Thus there's a permanent pointer until you remove this serialized function definition (which is in the ray internals).

Instead use this type of pattern:

#%% Imports
import ray
import time
import psutil
import numpy as np


#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
    ray.init(num_cpus=num_cpus, include_dashboard=False)

# Define function to do work in parallel
@ray.remote
def my_function(big_data_object, x):
    time.sleep(1)
    return big_data_object[0,0]+x

# Define large data
#big_data_object = np.random.rand(16400,16400)  # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600)  # Define an object of approx 1 GB.
big_data_object = np.random.rand(8100,8100)  # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000)  # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)
print("ref in a driver ", big_data_object_ref)

# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
    result_refs.append(my_function.remote(big_data_object_ref, item))
    
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))
print(result_refs)

#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
#for index in range(4):
#    del result_refs[0]
del result_refs

del big_data_object_ref
import time
time.sleep(1000)

The difference is that now we pass big_data_object_ref as argument to the remote function instead of capturing it in the remote function.

Note: When an object reference is passed to a remote function, they are automatically dereferenced. So no need to use ray.get() in the remote function. If you'd like to explicitly call ray.get() inside a remote function, then pass the object reference inside a list or dictionary as argument to the remote function. In this case, you get something like:

# Remote function
@ray.remote
def my_function(big_data_object_ref_list, x):
    time.sleep(1)
    big_data_object = ray.get(big_data_object_ref_list[0])
    return big_data_object[0,0]+x

# Calling the remote function
my_function.remote([big_data_object_ref], item)

Note 2: You use Spyder which uses an IPython console. There are some known issues right now between ray and an IPython console. Just make sure you delete the references inside your script, and not using commands entered directly to the IPython console (because then the references will be remove but the items will not be removed from the object store). To inspect the object store using the ray memory command while your script is running, you can use some code at the end of your script like:

#%% Testing ray
# ... my ray testing code

#%% Clean-up object store data
print("Wait 10 sec BEFORE deletion")
time.sleep(10)  # Now quickly use the 'ray memory' command to inspect the contents of the object store.

del result_refs
del big_data_object_ref

print("Wait 10 sec AFTER deletion")
time.sleep(10)  # Now again use the 'ray memory' command to inspect the contents of the object store.
Stefan
  • 919
  • 2
  • 13
  • 24
Sang
  • 885
  • 5
  • 4
  • Sure @MarcoBonelli! Didn't know about this. Thanks for letting me know that :). I modified my answer. – Sang Apr 01 '21 at 22:55
  • I modified the answer (which right now still has to be approved) to include all the relevant details we talked about on the GitHub. These details may help other users to learn Ray faster. – Stefan Apr 02 '21 at 09:01