The problem
I have a pandas DataFrame with a time series data for five years starting from 2006 where I add a PeriodIndex
that is automatically converted from Period
s made with pd.period_range()
as seen in the code block below.
There, I want to resample()
the four first years and I've used the time series offset aliases mentioned in the docs. When I use freq=1W
it works, but with e.g. a frequency of 2 (or likewise for 3 weeks) I get an error that says
IncompatibleFrequency: Input has different freq=2W-SUN from PeriodIndex(freq=W-SUN)
which is mentioned in the Periods part of the time series docs and it says:
Adding and subtracting integers from periods shifts the period by its own frequency. Arithmetic is not allowed between Period with different freq (span).
Honestly, I'm not sure how this relates to my issue.
The general form of the error is that if my freq=XY
, it gives Input has different freq=XY from PeriodIndex(freq=Y)
, unless X
is 1.
The data
The original dataset is from a csv-file with multiple columns, but in the example I only have a single column A
with the same number of rows.
import pandas as pd
# dummy DataFrame with 87648 rows
df = pd.DataFrame(dict(A=np.random.randint(1, 101, size=87648)))
# Add periods column, set as index
df['time'] = pd.period_range(start='2006-01-01 00:30', freq='30min', end='2011-01-01')
df = df.set_index('time')
Now, if I in e.g. ipython type df.index
I get the following output:
PeriodIndex(['2006-01-01 00:30', '2006-01-01 01:00', '2006-01-01 01:30',
'2006-01-01 02:00', '2006-01-01 02:30', '2006-01-01 03:00',
'2006-01-01 03:30', '2006-01-01 04:00', '2006-01-01 04:30',
'2006-01-01 05:00',
...
'2010-12-31 19:30', '2010-12-31 20:00', '2010-12-31 20:30',
'2010-12-31 21:00', '2010-12-31 21:30', '2010-12-31 22:00',
'2010-12-31 22:30', '2010-12-31 23:00', '2010-12-31 23:30',
'2011-01-01 00:00'],
dtype='period[30T]', name='time', length=87648, freq='30T')
This seems to be along my expectations and match the data in the csv file from where it's loaded:
- There are 87648 rows.
- The first timestamp is 2006-01-01 00:30.
- The last timestamp is 2011-01-01 00:00.
The attempt(s)
# This works
df['A'].loc['2006':'2009'].resample('1W').mean().plot()
# This gives error mentioned above
df['A'].loc['2006':'2009'].resample('2W').mean().plot()
Further:
- I have the same problem if I try to use
freq=6M
, but it works if I dofreq=1M
. (Input has different freq=6M from PeriodIndex(freq=M)
) - It also fails with
7D
, which according to my expectations should be the same as1W
.
Additional thoughts
There are obviously situations where certain periods won't work, but for half-hour data over several years, I'd expect that it would be possible to produce any smaller frequencies like arbitrary number of hours, days, weeks or months.
According to this answer, the following is a better approach:
df['A'].resample('D').interpolate()[::7]
but that gives me an InvalidIndexError: Reindexing only valid with uniquely valued Index objects
. (I assume that there are duplicate index values at hours going from summer to winter during sunlight saving time.)
Also, I'm under the impression pandas aim to do such "heavy lifting" for us, and assume that a deeper understanding would enable users to utilize it without such workarounds.
Although there are several posts on SO on resampling, I've searched for "IncompatibleFrequency" and "Input has different freq", but there seems to be no other posts on it.
The question
I would like to understand why the error is raised, and how to resolve the issue of resampling to arbitrary periods - or at least to understand the limitations.