I am trying to read a csv using dask and then resample it based on its timestamp index.
The csv file has content like:
Time,data
2015-01-01,0
2015-01-02,1
2015-01-03,2
2015-01-04,3
...
Method 1: Using dask
to load the data directly and then setup the index:
import pandas as pd
import dask.dataframe as dd
data_sample = dd.read_csv('test_data.csv')
meta=pd.Series([], name='time',dtype=pd.Timestamp)
data_sample['Time'] = data_sample['Time'].map_partitions(pd.to_datetime, meta=meta)
data_sample2 = data_sample.set_index(data_sample['Time'])
data_sample2.index.head()
And I got:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05'],
dtype='datetime64[ns]', name='Time', freq=None)
However, when I am trying to do: data_sample2.resample('1M').mean()
I have the following error: TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Method 2:
If I use pandas
load the data and then convert it to dask.dataframe
, it seems to be OK:
pd_data = pd.read_csv('test_data.csv')
pd_data['Time'] = pd.to_datetime(pd_data['Time'])
pd_data.set_index(pd_data['Time'],inplace=True)
pd_data.index
data_sample_from_pd = dd.from_pandas(pd_data, npartitions=1)
data_sample_from_pd.index.head()
And the dtype seems to be the same:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05'],
dtype='datetime64[ns]', name='Time', freq=None)
And the resample works fine:
data_sample_from_pd.resample('1M').mean().head()
data
2015-01-31 15.0
2015-02-28 44.5
2015-03-31 74.0
2015-04-30 94.5
Any idea why those two methods gives different result in executing resample
? Any suggestions that how I should do to get method 1 work? Thank you!