2

I have a csv file that looks like this when I load it:

# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')

# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y

>>> y.head()
user  datehour           
A     2020-02-01 00:00:00    0
      2020-02-01 01:00:00    0
      2020-02-01 02:00:00    1
      2020-02-01 03:00:00    0
      2020-02-01 04:00:00    0
Name: y, dtype: int64

I have the following function to create a lagged feature of an index level:

def shift_index(a, dt_idx_name, lag_freq, lag):

    # get datetime index of relevant level
    ac = a.copy()
    dti = ac.index.get_level_values(dt_idx_name)

    # shift it
    dti_shifted = dti.shift(lag, freq=lag_freq)

    # put it back where you found it
    ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)

    return ac

But when I run: y_lag = shift_index(y, 'datehour', 'H', 1), I get the following error:

ValueError: Level values must be unique...

(I can actually suppress this error by adding verify_integrity=False in .index.set_levels... in the function, but that (predictably) causes problems down the line)

Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour') shows a freq='H' attribute right after it's created, but freq=None once its reloaded from csv.

That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing. df.index.freq = pd.tseries.frequencies.to_offset("H"). And this answer also didn't work for my MultiIndex.

So I think I could solve this if I were able to set the freq attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y data with shifted DateTime MultiIndex component, such as with my shift_index function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.

Max Power
  • 8,265
  • 13
  • 50
  • 91
  • 1
    Great question. I have a similar issue, that the `freq` is set to `None` after an opertation - in my case, when I reindex with `df.index = pd.MultiIndex.from_arrays([qhour.index, qhour.index.year])`. I'm interested to see if there is a more... straightforward... way of doing things (no offence @mcskinner ;)) – ElRudi Apr 20 '20 at 15:05

1 Answers1

2

After much fidgeting, I was able to set an hourly frequency using asfreq('H') on grouped data, such that each group has unique values for the datehour index.

y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y

Peeking at an index value shows the correct frequency.

y.index[0]                                                                                                                                                                                                                          
# ('A', Timestamp('2020-02-01 00:00:00', freq='H'))

All this is doing is setting the index in two parts. The user goes first so that the nested datehour index can be unique within it. Once the datehour index is unique, then asfreq can be used without difficulty.

If you try asfreq on a non-unique index, it will not work.

y_load.set_index('datehour').asfreq('H')
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)
# <ipython-input-433-3ba51b619417> in <module>
# ----> 1 y_load.set_index('datehour').asfreq('H')
# ...
# ValueError: cannot reindex from a duplicate axis
mcskinner
  • 2,620
  • 1
  • 11
  • 21
  • Wow, setting an index inside a groupby/apply/lambda to set the freq, wild. After much fiddling indeed I am sure. I am super impressed that you figured that out, and I can confirm it absolutely works (+1). Hope you don't mind I'm gonna wait a day or two and see if anyone comes up with a simpler / more idiomatic way, which I gotta think must exist? – Max Power Apr 18 '20 at 23:51
  • 1
    Totally fine. Glad this helped! And yeah, pretty wild. Pandas is great when it does what you want out of the box, but can get pretty awkward if it doesn't. – mcskinner Apr 18 '20 at 23:52
  • 1
    posted a question for the pandas devs here, we'll see if they have a simpler way. https://github.com/pandas-dev/pandas/issues/33647 – Max Power Apr 19 '20 at 03:54