I have an embarrassingly parallel workload where I am reading a group of parquet files, concatenating them into bigger parquet files, and then writing it back to the disk. I am running this in a distributed computer (with distributed filesystem) with some ~300 workers, with each worker having 20GB of RAM. Each individual piece of work should only be consuming 2-3 GB of RAM but somehow the workers are crashing due to memory error (getting: distributed.scheduler.KilledWorker exception). I can see the following on the worker's output log:
Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory. Process memory: 18.20 GB
with open('ts_files_list.txt', 'r') as f:
all_files = f.readlines()
# There are about 500K files
all_files = [f.strip() for f in all_files]
# grouping them into groups of 50.
# The concatenated df should be about 1GB in memory
npart = 10000
file_pieces = np.array_split(all_files, npart)
def read_and_combine(filenames, group_name):
dfs = [pd.read_parquet(f) for f in filenames]
grouped_df = pd.concat(dfs)
grouped_df.to_parquet(f, engine='pyarrow')
group_names = [f'group{i} for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)
# the following line shouldn't have resulted in a memory error, but it does
dask.compute(map(delayed_func, file_pieces, group_names))
Am I missing something obvious here? Thanks!
Dask version: 2021.01.0, pyarrow version: 2.0.0, distributed version: 2021.01.0