Dask Dataframe: Resample partitioned data loaded from multiple parquet files

Question

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.

df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)

For instance, df_resampled = df.resample('1T').mean().compute() gives following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
      1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
   2627         from .tseries.resample import Resampler
   2628 
-> 2629         return Resampler(self, rule, closed=closed, label=label)
   2630 
   2631     @derived_from(pd.DataFrame)

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
    118                 "for more information."
    119             )
--> 120             raise ValueError(msg)
    121         self.obj = obj
    122         self._rule = pd.tseries.frequencies.to_offset(rule)

ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.

I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,

In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).

I then tried following, but no success.

df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')

This step throws the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
      1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
      3 # df_resampled = df.resample('1T').mean().compute()

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
   3915                 npartitions=npartitions,
   3916                 divisions=divisions,
-> 3917                 **kwargs,
   3918             )
   3919 

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
    483     if divisions is None:
    484         sizes = df.map_partitions(sizeof) if repartition else []
--> 485         divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
    486         mins = index2.map_partitions(M.min)
    487         maxes = index2.map_partitions(M.max)

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
   3755             return self[key]
   3756         else:
-> 3757             raise AttributeError("'DataFrame' object has no attribute %r" % key)
   3758 
   3759     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'

Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

Is your original data in fact sorted on the Timestamps column? — mdurant, Mar 02 '21 at 19:34
Individual files, yes! but not sure once all the files are loaded together as dask dataframe. — Milan Jain, Mar 02 '21 at 19:59

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

0 Answers0