2

I am trying to read a csv using dask and then resample it based on its timestamp index.

The csv file has content like:

Time,data
2015-01-01,0
2015-01-02,1
2015-01-03,2
2015-01-04,3
...

Method 1: Using dask to load the data directly and then setup the index:

import pandas as pd
import dask.dataframe as dd
data_sample = dd.read_csv('test_data.csv')
meta=pd.Series([], name='time',dtype=pd.Timestamp)
data_sample['Time'] = data_sample['Time'].map_partitions(pd.to_datetime, meta=meta)
data_sample2 = data_sample.set_index(data_sample['Time'])
data_sample2.index.head()

And I got:

DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
           '2015-01-05'],
          dtype='datetime64[ns]', name='Time', freq=None)

However, when I am trying to do: data_sample2.resample('1M').mean()

I have the following error: TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Method 2:

If I use pandas load the data and then convert it to dask.dataframe, it seems to be OK:

pd_data = pd.read_csv('test_data.csv')
pd_data['Time'] = pd.to_datetime(pd_data['Time'])
pd_data.set_index(pd_data['Time'],inplace=True)
pd_data.index
data_sample_from_pd = dd.from_pandas(pd_data, npartitions=1)
data_sample_from_pd.index.head()

And the dtype seems to be the same:

DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
           '2015-01-05'],
          dtype='datetime64[ns]', name='Time', freq=None)

And the resample works fine:

data_sample_from_pd.resample('1M').mean().head()

data
2015-01-31  15.0
2015-02-28  44.5
2015-03-31  74.0
2015-04-30  94.5

Any idea why those two methods gives different result in executing resample? Any suggestions that how I should do to get method 1 work? Thank you!

DigitalPig
  • 83
  • 6

0 Answers0