I am new to Dask and exported a pandas Dataframe to Parquet with row groups:
x.to_parquet(path + 'ohlcv_TRX-PERP_978627_rowgrouped.prq', row_group_size=1000)
Then I tried to load it with Dask, which seems to work correctly(?):
x = dd.read_parquet(path + 'ohlcv_TRX-PERP_978627_rowgrouped.prq')
x
# Note: The dataframe has almost 2000 columns, I clipped them for here
Dask DataFrame Structure:
open h
npartitions=978
2019-07-21 23:55:00 float64 floa
2019-07-22 16:35:00 ...
... ...
2021-05-30 17:06:00 ...
2021-05-31 03:32:00 ...
Dask Name: read-parquet, 978 tasks
So far, no issues. But when I call x.max().compute()
on it, Dask seems to load the entire dataset into RAM (at least RAM ramps up like crazy) and then crashes. Only looking at max()
:
x = x.max()
x
Dask Series Structure:
npartitions=1
ACCBL_10 float64
volume ...
dtype: float64
Dask Name: dataframe-max-agg, 1957 tasks
According to the Dask tutorial https://tutorial.dask.org/04_dataframe.html#Computations-with-dask.dataframe to my understanding this should work just fine though(?)
It also goes out of memory when I try to call max()
only on one column:
x.open.max().compute()
Am I doing something wrong or is that how it's supposed to work and I have to do something differently?
I now also tried to use the distributed
system and limit the Client to 10GB, but again Dask eats 24GB of RAM and just prints a warning that the worker group is completely exceeding the set memory limit:
if __name__ == '__main__':
client = Client(processes=False, memory_limit='5GB')
x = dd.read_parquet(path + 'ohlcv_TRX-PERP_978627_rowgrouped.prq')
print(x)
s = x.max().compute()
print(s)
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 24.13 GB -- Worker memory limit: 5.00 GB