2

I am using python 3 with dask to read a list of parquet files, do some processing and then put it all into a new united parquet file for later usage.

The process uses so much memory that it seems it tries to read all the parquet files into memory before writing them into the new parquet file.

I am using the following code

def t(path):
    import dask.dataframe as dd
    ddf = dd.read_parquet(path)
    ddf["file"] = path
    return ddf

b = bag.from_sequence(parquet_files)
with ProgressBar():
       data = b.map(lambda x: t(x)).\
              map(lambda y: dd.to_parquet(y, output_parquet_file, partition_on=["file"], append=True, engine="fastparquet")).\
           compute(num_workers=1)

The memory explodes, every time, when using one worker, and especially when using more. The files are big (about 1G each) and I tried to read the information from csv files and breaking them into 25M blocks, and got the same issue.

What am I missing here? Why does it try to load everything into memory when it seems that iterative process is the right thing to do here? How can I use dask operations to do it without blowing up the 128G of memory I have on that machine?

PS I tried using pyarrow engine but the problem was that append is not yet implemented in dask.

edit: tried suggested solution: I try this code now

import dask.dataframe as dd
with ProgressBar():
    dfs = [dd.read_parquet(pfile) for pfile in parquet_files]
    for i, path in enumerate(parquet_files):
        dfs[i]["file"] = path
    df = dd.concat(dfs)
    df.to_parquet(output_parquet_file)

and still, memory explodes (on a system with more then 200G memory)

thebeancounter
  • 4,261
  • 8
  • 61
  • 109

2 Answers2

2

It is odd to use dask collection methods within the map on another collection. You could use bag.map like this and call the fastaprquet functions directly or, perhaps better (depending on what processing you need to do), use the dataframe API for everything:

dfs = [dd.read_parquet(pfile, ...) for pfile in parquet_files]
df = dd.concat(dfs)
df.to_parquet(...)

Note that, although you are trying to append to a single file (I think), the parquet format doesn't really benefit from that and you would do as well to let Dask write a file per partition.

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

dask support reading multi parquet files as partitions. Just call it directly.

import dask.dataframe as dd
ddf = dd.read_parquet("parquets/*.parquet")
ddf = ddf.map_partitions(lambda df: df*2)
ddf.to_parquet("result.parquet")
张云辉
  • 101
  • 1
  • 3