2

I have an embarrassingly parallel workload where I am reading a group of parquet files, concatenating them into bigger parquet files, and then writing it back to the disk. I am running this in a distributed computer (with distributed filesystem) with some ~300 workers, with each worker having 20GB of RAM. Each individual piece of work should only be consuming 2-3 GB of RAM but somehow the workers are crashing due to memory error (getting: distributed.scheduler.KilledWorker exception). I can see the following on the worker's output log:

Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory. Process memory: 18.20 GB

with open('ts_files_list.txt', 'r') as f:
    all_files = f.readlines()

# There are about 500K files
all_files = [f.strip() for f in all_files]

# grouping them into groups of 50. 
# The concatenated df should be about 1GB in memory
npart = 10000
file_pieces = np.array_split(all_files, npart)

def read_and_combine(filenames, group_name):
    dfs = [pd.read_parquet(f) for f in filenames]
    grouped_df = pd.concat(dfs)
    grouped_df.to_parquet(f, engine='pyarrow')

group_names = [f'group{i} for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)

# the following line shouldn't have resulted in a memory error, but it does
dask.compute(map(delayed_func, file_pieces, group_names)) 

Am I missing something obvious here? Thanks!

Dask version: 2021.01.0, pyarrow version: 2.0.0, distributed version: 2021.01.0

rajendra
  • 472
  • 3
  • 18
  • Are the files on the local filesystem? Please indicate your versions of dask, distributed and pyarrow. – mdurant Mar 08 '21 at 20:33
  • @mdurant The files are on a distributed filesystem of a supercomputer. Dask version: 2021.01.0, pyarrow version: 2.0.0, distributed version: 2021.01.0 – rajendra Mar 08 '21 at 20:48
  • I would try downgrading pyarrow or attempting with fastparquet before doing anything else. – mdurant Mar 11 '21 at 14:51

1 Answers1

0

There are a couple of syntactic errors, but overall the workflow seems reasonable.

with open('ts_files_list.txt', 'r') as f:
    all_files = f.readlines()

all_files = [f.strip() for f in all_files]

npart = 10000
file_pieces = np.array_split(all_files, npart)

def read_and_combine(filenames, group_name):
    grouped_df = pd.concat(pd.read_parquet(f) for f in filenames)
    grouped_df.to_parquet(group_name, engine='pyarrow')
    del grouped_df # this is optional (in principle dask should clean up these objects)

group_names = [f'group{i}' for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)

dask.compute(map(delayed_func, file_pieces, group_names))

One more thing to keep in mind is that parquet files are compressed by default, so when unpacked they could occupy much more memory than their compressed file size. Not sure if this applies to your workflow, but something to keep in mind when experiencing memory problems.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46