1

I have a df with a DateTimeIndex of 30 minute intervals over a long period (> 1 year), so >17520 rows. For reasons related to daylight savings, two of the index values are repeated in the index and two values are missing. So the duplicated values are:

In[1]: df[df.index.duplicated('first')] 
Out[2]: 
                            a           b        c 
timestamp                                                                   
2012-10-07 01:00:00           NaN        NaN      NaN     
2012-10-07 01:30:00           NaN        NaN      NaN     
2013-10-06 01:00:00           NaN        NaN      NaN      
2013-10-06 01:30:00           NaN        NaN      NaN     

I want to change these to the missing values, 1 hour later:

In[3]: df[df.index.duplicated('first')].shift(1,freq="H")
Out[4]: 
                           a            b        c
timestamp                                                                   
2012-10-07 02:00:00           NaN        NaN      NaN     
2012-10-07 02:30:00           NaN        NaN      NaN     
2013-10-06 02:00:00           NaN        NaN      NaN        
2013-10-06 02:30:00           NaN        NaN      NaN 

But this doesn't change the index:

df[df.index.duplicated('first')] = df[df.index.duplicated('first')].shift(1,freq="H")

What would?

doctorer
  • 1,672
  • 5
  • 27
  • 50

1 Answers1

0

I think you need map duplicated index with rename by dict:

print (df)
                     a   b   c
timestamp                     
2013-10-06 01:00:00  1 NaN NaN
2013-10-06 01:30:00  2 NaN NaN
2013-10-06 01:00:00  3 NaN NaN
2013-10-06 01:30:00  4 NaN NaN
2012-10-08 01:30:00  5 NaN NaN
2013-10-10 01:00:00  6 NaN NaN


df1 = df[df.index.duplicated('first')]
d = dict(zip(df1.index, df1.shift(1,freq="H").index))
print (d)
{Timestamp('2013-10-06 01:00:00'): Timestamp('2013-10-06 02:00:00'), 
 Timestamp('2013-10-06 01:30:00'): Timestamp('2013-10-06 02:30:00')}

df = df.rename(index=d)
print (df)
                     a   b   c
timestamp                     
2013-10-06 02:00:00  1 NaN NaN
2013-10-06 02:30:00  2 NaN NaN
2013-10-06 02:00:00  3 NaN NaN
2013-10-06 02:30:00  4 NaN NaN
2012-10-08 01:30:00  5 NaN NaN
2013-10-10 01:00:00  6 NaN NaN

Similar solution:

idx = df.index[df.index.duplicated('first')]
d = dict(zip(idx, idx.to_series().shift(freq="H").index))
print (d)
{Timestamp('2013-10-06 01:00:00'): Timestamp('2013-10-06 02:00:00'), 
 Timestamp('2013-10-06 01:30:00'): Timestamp('2013-10-06 02:30:00')}

df = df.rename(index=d)
print (df)
                     a   b   c
timestamp                     
2013-10-06 02:00:00  1 NaN NaN
2013-10-06 02:30:00  2 NaN NaN
2013-10-06 02:00:00  3 NaN NaN
2013-10-06 02:30:00  4 NaN NaN
2012-10-08 01:30:00  5 NaN NaN
2013-10-10 01:00:00  6 NaN NaN
2013-10-06 02:30:00   8 NaN NaN
2012-10-08 01:30:00   9 NaN NaN
2013-10-10 01:00:00  10 NaN NaN

idx = df.index[df.index.duplicated('first')]
s = idx.to_series().shift(freq="H")
#swap index with values in Series
d = pd.Series(s.index.values, index = s.values).to_dict()
print (d)
{Timestamp('2013-10-06 01:00:00'): Timestamp('2013-10-06 02:00:00'), 
 Timestamp('2013-10-06 01:30:00'): Timestamp('2013-10-06 02:30:00')}

df = df.rename(index=d)
print (df)
                     a   b   c
timestamp                     
2013-10-06 02:00:00  1 NaN NaN
2013-10-06 02:30:00  2 NaN NaN
2013-10-06 02:00:00  3 NaN NaN
2013-10-06 02:30:00  4 NaN NaN
2012-10-08 01:30:00  5 NaN NaN
2013-10-10 01:00:00  6 NaN NaN

EDIT1:

You need add timedeltas created by cumcount with to_timedelta to original index.

delta = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='H')
print (delta)
timestamp
2013-10-06 01:00:00   00:00:00
2013-10-06 01:30:00   00:00:00
2013-10-06 01:00:00   01:00:00
2013-10-06 01:30:00   01:00:00
2012-10-08 01:30:00   00:00:00
2013-10-10 01:00:00   00:00:00
dtype: timedelta64[ns]

df.index = df.index + delta
print (df)
                     a   b   c
2013-10-06 01:00:00  1 NaN NaN
2013-10-06 01:30:00  2 NaN NaN
2013-10-06 02:00:00  3 NaN NaN
2013-10-06 02:30:00  4 NaN NaN
2012-10-08 01:30:00  5 NaN NaN
2013-10-10 01:00:00  6 NaN NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • No. The first suggestion gives df1 containing ONLY the changed timestamps (not the rest of the year). The second suggestion shifts every timestamp in the df, not just the duplicated ones. – doctorer Apr 07 '17 at 06:17
  • Thanks. That's almost there, but not quite. This has renamed both instances of the duplicated values, so I now have duplicates of `'2012-10-07 02:00:00'` etc. I want to rename only the _second_ instance of each timestamp – doctorer Apr 07 '17 at 06:35
  • Can you explain why? – jezrael Apr 07 '17 at 06:36
  • So there are two occurrences of `'2012-10-07 01:00:00'` in the df, the first one is correct, the second should be changed to `'2012-10-07 02:00:00'` If I change them both, I have 2 occurences of the 2am timestamp, and none of the 1am. – doctorer Apr 09 '17 at 06:50