1

I am new to Dask and exported a pandas Dataframe to Parquet with row groups:

x.to_parquet(path + 'ohlcv_TRX-PERP_978627_rowgrouped.prq', row_group_size=1000)

Then I tried to load it with Dask, which seems to work correctly(?):

x = dd.read_parquet(path + 'ohlcv_TRX-PERP_978627_rowgrouped.prq')
x

# Note: The dataframe has almost 2000 columns, I clipped them for here
Dask DataFrame Structure:
                        open     h
npartitions=978                   
2019-07-21 23:55:00  float64  floa
2019-07-22 16:35:00      ...      
                      ...      ...
2021-05-30 17:06:00      ...      
2021-05-31 03:32:00      ...      
Dask Name: read-parquet, 978 tasks

So far, no issues. But when I call x.max().compute() on it, Dask seems to load the entire dataset into RAM (at least RAM ramps up like crazy) and then crashes. Only looking at max():

x = x.max()
x

Dask Series Structure:
npartitions=1
ACCBL_10    float64
volume          ...
dtype: float64
Dask Name: dataframe-max-agg, 1957 tasks

According to the Dask tutorial https://tutorial.dask.org/04_dataframe.html#Computations-with-dask.dataframe to my understanding this should work just fine though(?)

It also goes out of memory when I try to call max() only on one column:

x.open.max().compute()

Am I doing something wrong or is that how it's supposed to work and I have to do something differently?

I now also tried to use the distributed system and limit the Client to 10GB, but again Dask eats 24GB of RAM and just prints a warning that the worker group is completely exceeding the set memory limit:

if __name__ == '__main__':

    client = Client(processes=False, memory_limit='5GB')

    x = dd.read_parquet(path + 'ohlcv_TRX-PERP_978627_rowgrouped.prq')
    print(x)
    s = x.max().compute()
    print(s)


distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 24.13 GB -- Worker memory limit: 5.00 GB
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Do you know if you are using fastparquet or pyarrow as the parquet engine? You may want to try the other. To get max of only one column, providing `columns=` to `read_parquet` will make sure you don't load unnecessary data. How many cores do you have? Try running with fewer workers than cores. – mdurant Jun 18 '21 at 20:47

1 Answers1

0

If possible, I would save the parquet into multiple files (size-wise depends on your hardware, but around 100-200 MB per partition would be good on a laptop). If that is not an option, then try the following:

x.open.max(split_every=2).compute()

What this does is ask dask to compute max value for each partition and then compare max value for every 2 partitions, which reduces memory footprint at the expense of having more tasks to run. You can play around with the split_every number to see if a higher value is tolerable on your hardware, but hopefully 2 will work.

Also, if you intend to work with a single file, you might get better performance with vaex, see this comparison.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks, I tried this, but the behaviour seems similar - Dask is just hogging 20GB of RAM. Note that I can load the Dataframe in Pandas and do the calculation - it will exceed RAM and swap just barely, but it will finish the computation. So I am even more confused what is going on with Dask. :/ – no-trick-pony Jun 18 '21 at 16:02
  • Not sure, I used `row_group_size` only without `dask`... if you can load into pandas, then you can also create dask dataframe with `dd.from_pandas` and then persist it with `ddf.to_parquet()`. – SultanOrazbayev Jun 18 '21 at 16:06