RAY workers being killed because of OOM pressure

Question

I am using modin in combination with ray to read a huge csv file (56GB with 1,5 billion rows). I sorted the data beforehand using linux sort.

The following code results in multiple workers being killed due to out of memory pressure and I doubt that the computation is efficient / will ever run through.

I am using Ray on a local machine with 48 cores and 126GB of RAM.

How would I tackle this issue efficiently? Unfortunately I cannot access the web-interface to check on things since it is hosted on a ubuntu server version with no access through the firewall.

Code:

import modin.pandas as pd
import ray
ray.init()

df = pd.read_csv("./file", index_col=0, header=None, names=["1", "2"])

df.groupby('1').sum()

RayContext:

RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.10', ray_version='2.3.0', ray_commit='cf7a56b4b0b648c324722df7c99c168e92ff0b45', address_info={'node_ip_address': 'XXXXXXX', 'raylet_ip_address': 'XXXXXXX', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133', 'metrics_export_port': XXXXXX, 'gcs_address': 'XXXX', 'address': 'XXXXXXX', 'dashboard_agent_listen_port': XXXXXX, 'node_id': 'XXXXXXXXXXXXXXXXXXXXXXXXXX'})

score 1 · Answer 1 · answered Apr 18 '23 at 13:04

I'm one of the core developers of Modin.

In the current implementation of read_csv function in Modin, I can expect up to about 2x memory overhead when reading in some cases. In your case, the peak memory consumption is already close to the maximum amount of your available memory.

What I would like to suggest to you:

It's better not to initialize the engine (Ray) yourself, Modin will do it itself the first time you run read_csv function. Have you had problems with this? I believe that in some cases, this can help, as Modin tries to find the best initialization parameters for it work.
Only as a workaround, you can try to read the file with pandas, and then build the Modin dataframe (pd.DataFrame(pandas.read_csv(...))).
If you have the opportunity, then I suggested increasing the available memory up to 256GB.

From our side, we will try to look for solutions to reduce memory consumption: https://github.com/modin-project/modin/issues/6018

Thank you for using Modin!

RAY workers being killed because of OOM pressure

1 Answers1