I am using python 3 with dask to read a list of parquet files, do some processing and then put it all into a new united parquet file for later usage.
The process uses so much memory that it seems it tries to read all the parquet files into memory before writing them into the new parquet file.
I am using the following code
def t(path):
import dask.dataframe as dd
ddf = dd.read_parquet(path)
ddf["file"] = path
return ddf
b = bag.from_sequence(parquet_files)
with ProgressBar():
data = b.map(lambda x: t(x)).\
map(lambda y: dd.to_parquet(y, output_parquet_file, partition_on=["file"], append=True, engine="fastparquet")).\
compute(num_workers=1)
The memory explodes, every time, when using one worker, and especially when using more. The files are big (about 1G each) and I tried to read the information from csv files and breaking them into 25M blocks, and got the same issue.
What am I missing here? Why does it try to load everything into memory when it seems that iterative process is the right thing to do here? How can I use dask operations to do it without blowing up the 128G of memory I have on that machine?
PS I tried using pyarrow engine but the problem was that append is not yet implemented in dask.
edit: tried suggested solution: I try this code now
import dask.dataframe as dd
with ProgressBar():
dfs = [dd.read_parquet(pfile) for pfile in parquet_files]
for i, path in enumerate(parquet_files):
dfs[i]["file"] = path
df = dd.concat(dfs)
df.to_parquet(output_parquet_file)
and still, memory explodes (on a system with more then 200G memory)