0

I would like to process some textual data with “sentence-transformers” (generated embeddings for textual data) on multiple GPUs (2 T4, 15 GB per GPU) and 16 vCPUs (with 60 GB RAM) on GCP from Jupyter notebook.

The data size is not large but the worker nodes were restarted due to memory leakage even though the garbage collection threshold was set up from shell.

My code:

# run export MALLOC_TRIM_THRESHOLD_=65536 from shell before starting dask cluster

!pip install sentence-transformers
import os
import glob
import numpy as np
import gc

import cudf
import dask_cudf
import cupy
import rmm


from dask.distributed import Client, wait, get_worker, get_client
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0,1", n_workers=2, threads_per_worker=4, memory_limit="15GB",\
                           device_memory_limit="24GB", rmm_pool_size="4GB", rmm_maximum_pool_size="15GB") 

client = Client(cluster)

print(client.run(os.getenv, "MALLOC_TRIM_THRESHOLD_")) # 65536

initial_pool_size = 4*10**9
maximum_pool_size = 15*10**9
rmm.reinitialize(pool_allocator=True, managed_memory=True, initial_pool_size=initial_pool_size, 
                 maximum_pool_size=maximum_pool_size, devices=[0,1], logging=True, log_file_name='./tmp/logs/test_sbert_distributed.log')



import dask.dataframe as dd
import pandas as pd
from dask.multiprocessing import get
import random

df = pd.DataFrame({'col_1': ["This is sentence " + str(x) for x in random.sample(range(10**7), 10**7)], 
                             'col_2': ["That is another sentence " + str(x) for x in random.sample(range(10**7), 10**7)]})

cudf_df = cudf.DataFrame.from_pandas(df)
dask_df = dask_cudf.from_cudf(cudf_df, npartitions=8)


from sentence_transformers import SentenceTransformer
import numpy as np

sbert_model = SentenceTransformer('all-MiniLM-L6-v2')


def test_f_str(df, args):
    col1, col2, chunks = args

    for col in [col1, col2]:
        emb = sbert_model.encode(sentences=df[col].to_arrow().to_pylist(), batch_size=1250, show_progress_bar=True)
        semb = np.array([str(x) for x in emb])
        df[col+'_emb'] = semb
    return df

dask_cudf.core.Series
    

chunks = dask_df.map_partitions(lambda x: len(x)).compute().to_numpy()
print(chunks, type(chunks))

[1250000 1250000 1250000 1250000 1250000 1250000 1250000 1250000] <class 'numpy.ndarray'>

dask_df.npartitions, dask_df.persist()
(8, <dask_cudf.DataFrame | 8 tasks | 8 npartitions>)

new_dask_df = dask_df.map_partitions(test_f_str, 
                                 args=('col_1', 'col_2', chunks),\
                                 meta={'col_1':'object',\
                                       'col_2':'object',\
                                       'col_1_emb':'object',\
                                       'col_2_emb':'object'}) 

new_dask_df.dtypes

col_1        object
col_2        object
col_1_emb    object
col_2_emb    object
dtype: object


new_dask_df.compute() # error: WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory 
#  may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. 
#  -- Unmanaged memory: 9.96 GiB -- Worker memory limit: 13.97 GiB

I have tried many solutions but they are not helpful for the issue.

    https://www.coiled.io/blog/tackling-unmanaged-memory-with-dask
    https://stackoverflow.com/questions/71203077/why-does-dask-distributed-auto-memory-trimming-not-work
    https://github.com/dask/distributed/issues/5971
    https://stackoverflow.com/questions/72180961/dask-memory-leak-workaround
    https://stackoverflow.com/questions/58275476/dask-distributed-workers-always-leak-memory-when-running-many-tasks
    https://distributed.dask.org/en/stable/worker-memory.html

Could anybody point out what I missed here ?

============== UPDATE ===================

I am using the dashboards of https://developer.nvidia.com/blog/gpu-dashboards-in-jupyter-lab/ And https://developer.nvidia.com/blog/gpu-dashboards-in-jupyter-lab/ But the “workers memory” (bytes stored per worker) plots didn’t show any “unmanaged or leaked” ” memory.

But, the “GPU memory” plots’ color changed to orange and show memory spill based on https://distributed.dask.org/en/stable/worker-memory.html#using-the-dashboard-to-monitor-memory-usage

Please let me know how to confirm that it is CPU or GPU memory leakage ?

mtnt
  • 31
  • 5
  • It has 15GB per GPU. – mtnt Jun 26 '23 at 23:56
  • I have updated OP by adding the missing two lines of the code "cudf_df = cudf.DataFrame.from_pandas(df) dask_df = dask_cudf.from_cudf(cudf_df, npartitions=8)" – mtnt Jun 27 '23 at 16:14

1 Answers1

1

Your error suggests high memory use for system RAM, not GPU. Although there is a lot of code (too much to follow) you don't show how dask_df is created.

I will not that the following line is problematic:

emb_array = dask.array.from_array(semb, chunks=chunks)

This is happening in the context of map_partitions, so the input here should be pandas/cudf, and the output pandas/cudf or numpy/cupy. You should not be calling the dask API from within a function meant to be run as a task by dask.

Further, the whole dataframe is made in the client (using python string objects and lists!) instead of chunk-wise in worker tasks, which is a definite anti-pattern. This means calling IO functions within dask-cudf (such as read_parquet or from_map). You also call .persist(), which seems like a bad idea if you have memory troubles - with this in place, all memory is allocated up-front, but with out it, you would load eac chunk, process it, and then release memory (assuming you already did the recommendation in the first part of this paragraph).

Finally, you .compute() the whole thing, producing a bunch of objects all over the place. This is surely not the final point in the processing, why do a compute? This is copying the pieces and concatenating them together into a single dataframe: very memory wasteful unless you have already done a lot of aggregation.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks for the help. I have updated OP by adding the missing two lines of the code "cudf_df = cudf.DataFrame.from_pandas(df) dask_df = dask_cudf.from_cudf(cudf_df, npartitions=8)" – mtnt Jun 27 '23 at 16:13
  • please let me know how to create a data frame with "chunk-wide in worker tasks" ? If I don't use "chunks", I got an error. – mtnt Jun 27 '23 at 16:24
  • it seems that "persist()" may help allocate each partition to each worker (GPU) ? Please let me know how to make the allocation efficient without memory leakage ? thanks – mtnt Jun 27 '23 at 16:25
  • the VM has 60GB RAM and 16v CPUs, you mean the OOM error happened on CPU/RAM not on GPU memory ? Please let me know how to differentiate them ? – mtnt Jun 27 '23 at 16:48
  • 1
    I see no reason to think the warning was related to the GPU. You should watch process on the dashboard to see what memory is being used when. Added some other details. – mdurant Jun 27 '23 at 17:44
  • I have updated OP about how I used the dask dashboard to monitor memory usage. – mtnt Jun 27 '23 at 17:56
  • "compute()" is used to confirm that the pattern can work well without memory issues. Please let me know if there are better ways to do this ? thanks – mtnt Jun 27 '23 at 19:53
  • Perhaps `df.head()` or `len(df)` ? – mdurant Jun 27 '23 at 19:58
  • if I used "df.head()", only one GPU was busy and all other GPUs were idle. If I used "df.compute()", all GPUs were very busy (>95% utilization) but it finally caused memory leakage. – mtnt Jun 27 '23 at 20:41
  • and my other suggestion? – mdurant Jun 28 '23 at 01:05
  • "len(df)" works but still got "memory leakage", which enforced worker nodes to restart. It seems that the workers restarted in a cycle whenever a memory threshold was reached. – mtnt Jun 28 '23 at 22:33
  • Then please look into the rest of the answer, concerning how the dataframe was made and use of dask.array within your function. – mdurant Jun 29 '23 at 01:19
  • I have replaced "dask.array" with "numpy.array", but got the memory leakage error. – mtnt Jun 29 '23 at 15:57