I have a csv file that looks like this when I load it:
# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')
# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y
>>> y.head()
user datehour
A 2020-02-01 00:00:00 0
2020-02-01 01:00:00 0
2020-02-01 02:00:00 1
2020-02-01 03:00:00 0
2020-02-01 04:00:00 0
Name: y, dtype: int64
I have the following function to create a lagged feature of an index level:
def shift_index(a, dt_idx_name, lag_freq, lag):
# get datetime index of relevant level
ac = a.copy()
dti = ac.index.get_level_values(dt_idx_name)
# shift it
dti_shifted = dti.shift(lag, freq=lag_freq)
# put it back where you found it
ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)
return ac
But when I run:
y_lag = shift_index(y, 'datehour', 'H', 1)
, I get the following error:
ValueError: Level values must be unique...
(I can actually suppress this error by adding verify_integrity=False
in .index.set_levels...
in the function, but that (predictably) causes problems down the line)
Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour')
shows a freq='H'
attribute right after it's created, but freq=None
once its reloaded from csv.
That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing.
df.index.freq = pd.tseries.frequencies.to_offset("H")
. And this answer also didn't work for my MultiIndex.
So I think I could solve this if I were able to set the freq
attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y
data with shifted DateTime MultiIndex component, such as with my shift_index
function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.