python dask to_parquet taking a lot of memory

Question

I am using python 3 with dask to read a list of parquet files, do some processing and then put it all into a new united parquet file for later usage.

The process uses so much memory that it seems it tries to read all the parquet files into memory before writing them into the new parquet file.

I am using the following code

def t(path):
    import dask.dataframe as dd
    ddf = dd.read_parquet(path)
    ddf["file"] = path
    return ddf

b = bag.from_sequence(parquet_files)
with ProgressBar():
       data = b.map(lambda x: t(x)).\
              map(lambda y: dd.to_parquet(y, output_parquet_file, partition_on=["file"], append=True, engine="fastparquet")).\
           compute(num_workers=1)

The memory explodes, every time, when using one worker, and especially when using more. The files are big (about 1G each) and I tried to read the information from csv files and breaking them into 25M blocks, and got the same issue.

What am I missing here? Why does it try to load everything into memory when it seems that iterative process is the right thing to do here? How can I use dask operations to do it without blowing up the 128G of memory I have on that machine?

PS I tried using pyarrow engine but the problem was that append is not yet implemented in dask.

edit: tried suggested solution: I try this code now

import dask.dataframe as dd
with ProgressBar():
    dfs = [dd.read_parquet(pfile) for pfile in parquet_files]
    for i, path in enumerate(parquet_files):
        dfs[i]["file"] = path
    df = dd.concat(dfs)
    df.to_parquet(output_parquet_file)

and still, memory explodes (on a system with more then 200G memory)

Note that append is enabled for pyarrow in dask master now – mdurant Aug 05 '19 at 20:18 — mdurant, Aug 05 '19 at 20:18

score 2 · Answer 1 · answered Aug 04 '19 at 19:40

2

It is odd to use dask collection methods within the map on another collection. You could use bag.map like this and call the fastaprquet functions directly or, perhaps better (depending on what processing you need to do), use the dataframe API for everything:

dfs = [dd.read_parquet(pfile, ...) for pfile in parquet_files]
df = dd.concat(dfs)
df.to_parquet(...)

Note that, although you are trying to append to a single file (I think), the parquet format doesn't really benefit from that and you would do as well to let Dask write a file per partition.

answered Aug 04 '19 at 19:40

mdurant

27,272
5
45
74

Sorry, not working, look at the edit, memory still exploding, no file is too large for memory and still... – thebeancounter Aug 05 '19 at 20:13
You may want to try with the distributed scheduler (even on a single node) – mdurant Aug 05 '19 at 20:19

score 0 · Answer 2 · answered Jun 11 '21 at 02:23

0

dask support reading multi parquet files as partitions. Just call it directly.

import dask.dataframe as dd
ddf = dd.read_parquet("parquets/*.parquet")
ddf = ddf.map_partitions(lambda df: df*2)
ddf.to_parquet("result.parquet")

answered Jun 11 '21 at 02:23

张云辉

101
1
3

python dask to_parquet taking a lot of memory

2 Answers2