I am using modin in combination with ray to read a huge csv file (56GB with 1,5 billion rows). I sorted the data beforehand using linux sort.
The following code results in multiple workers being killed due to out of memory pressure and I doubt that the computation is efficient / will ever run through.
I am using Ray on a local machine with 48 cores and 126GB of RAM.
How would I tackle this issue efficiently? Unfortunately I cannot access the web-interface to check on things since it is hosted on a ubuntu server version with no access through the firewall.
Code:
import modin.pandas as pd
import ray
ray.init()
df = pd.read_csv("./file", index_col=0, header=None, names=["1", "2"])
df.groupby('1').sum()
RayContext:
RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.10', ray_version='2.3.0', ray_commit='cf7a56b4b0b648c324722df7c99c168e92ff0b45', address_info={'node_ip_address': 'XXXXXXX', 'raylet_ip_address': 'XXXXXXX', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2023-04-18_10-24-54_203554_3133', 'metrics_export_port': XXXXXX, 'gcs_address': 'XXXX', 'address': 'XXXXXXX', 'dashboard_agent_listen_port': XXXXXX, 'node_id': 'XXXXXXXXXXXXXXXXXXXXXXXXXX'})