I am trying out the promising multiprocessing package ray
. I have a problem I seem not be able to solve. My program runs fine the first time, but at a second run this Exception is raised on the ray.put()
line:
ObjectStoreFullError: Failed to put object ffffffffffffffffffffffffffffffffffffffff010000000c000000 in object store because it is full. Object size is 2151680255 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
What do I want to do:
In my actual code (I'm planning to write) I need to process many big_data_objects
sequentially. I want to hold one big_data_object
in memory at a time and do several heavy (independent) computations on the big data object. I want to execute these computations in parallel. When these are done, I have to replace these big_data_object
in the object store by new ones and start the computations (in parallel) again.
Using my test script I simulate this by starting the script again without ray.shutdown()
. If I shutdown ray
using ray.shutdown()
the object store is cleared but then reinitializing takes a long time and I cannot process multiple big_data_object
sequentially as I want to.
What sources of information have I studied: I studied this document Ray Design Patterns and studied the section 'Antipattern: Closure capture of large / unserializable object' and how to the proper pattern(s) should look like. I also studied the getting started guide which lead to the following test script.
A minimum example to reproduce the problem: I created a test script to test this. It is this:
#%% Imports
import ray
import time
import psutil
import numpy as np
#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
ray.init(num_cpus=num_cpus, include_dashboard=False)
# Define function to do work in parallel
@ray.remote
def my_function(x): # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
time.sleep(1)
data_item = ray.get(big_data_object_ref)
return data_item[0,0]+x
# Define large data
big_data_object = np.random.rand(16400,16400) # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600) # Define an object of approx 1 GB.
# big_data_object = np.random.rand(8100,8100) # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000) # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)
# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
result_refs.append(my_function.remote(item))
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))
#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
for index in range(4):
del result_refs[0]
del big_data_object_ref
Where do I think it's going wrong:
I think I delete all the references to the object store in the end of the script. As a result the objects should be cleared from the object store (as described here). Apparently, something is wrong because the big_data_object
remains in the object store. The results
are deleted from the object store just fine, however.
Some debug information:
I inspected the object store using ray memory
command, this is what I get:
(c:\python\cenv38rl) PS C:\WINDOWS\system32> ray memory
---------------------------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================================
; worker pid=20952
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=29368
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=17388
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=24208
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=27684
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=6860
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; driver pid=28684
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\worker.py:put_object:277 | c:\python\cenv38rl\lib\site-packages\ray\worker.py:put:1489 | c:\python\cenv38rl\lib\site-packages\ray\_private\client_mode_hook.py:wrapper:47 | C:\Users\Stefan\Documents\Python examples\Multiprocess_Ray3_SO.py:<module>:42
---------------------------------------------------------------------------------------------------------------------
--- Aggregate object store stats across all nodes ---
Plasma memory usage 2052 MiB, 1 objects, 77.41% full
Some of the things I have tried:
If, I replace my_function
for:
@ray.remote
def my_function(x): # Later I will have multiple different my_functions to extract separate feature from my big_data_objects
time.sleep(1)
# data_item = ray.get(big_data_object_ref)
# return data_item[0,0]+x
return 5
and then the script successfully clears the object store but my_function
cannot use the big_data_object
which I need it to.
My question is: How to fix my code so that the big_data_object
is removed from the object store at the end of my script without shutting down ray?
Note: I installed ray using pip install ray
which gave me version ray==1.2.0
which I am using now. I use ray on Windows
and I develop in Spyder v4.2.5
in a conda
(actually miniconda) environment in case it is relevant.
EDIT:
I have tested also on a Ubuntu machine with 8GB RAM. For this I used the big_data_object
of 1GB.
I can confirm the issue also occurs on this machine.
The ray memory
output:
(SO_ray) stefan@stefan-HP-ZBook-15:~/Documents/Ray_test_scripts$ ray memory
---------------------------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================================
; worker pid=18593
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18591
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18590
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; driver pid=17712
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 (put object) | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:wrapper:47 | /home/stefan/Documents/Ray_test_scripts/Multiprocess_Ray3_SO.py:<module>:43 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/spyder_kernels/customize/spydercustomize.py:exec_code:453
; worker pid=18592
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
---------------------------------------------------------------------------------------------------------------------
--- Aggregate object store stats across all nodes ---
Plasma memory usage 1026 MiB, 1 objects, 99.69% full
I have to run the program in Spyder
so that after execution of the program I can inspect the object store's memory using ray memory
. If I run the program in PyCharm
for example, ray
is automatically terminated as the script is completed so I cannot check if my script clears the object store as intended.