1

Given for instance a DataFrame with 1h Period, I would like to set 0 & 1 values in a new column whenever a new 5h Period starts and finishes respectively.

Let's consider this input data for instance:

import pandas as pd
from random import seed, randint
from collections import OrderedDict

p1h = pd.period_range(start='2020-02-01 00:00', end='2020-03-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

Result

df.head(10)

                  Values
p1h                     
2020-02-01 00:00       2
2020-02-01 01:00       9
2020-02-01 02:00       1
2020-02-01 03:00       4
2020-02-01 04:00       1
2020-02-01 05:00       7
2020-02-01 06:00       7
2020-02-01 07:00       7
2020-02-01 08:00      10
2020-02-01 09:00       6

Is there any way to set-up a new column so as to get following result? (1st & last row for each period is initialized with a 0 and a 1 respectively)

df['period5h'] = df.resample('5h').???

df.head(10)

                  Values   period5h
p1h                     
2020-02-01 00:00       2          0   <- 1st row of 5h period
2020-02-01 01:00       9
2020-02-01 02:00       1
2020-02-01 03:00       4
2020-02-01 04:00       1          1   <- last row of 5h period
2020-02-01 05:00       7          0   <- 1st row of 5h period
2020-02-01 06:00       7
2020-02-01 07:00       7
2020-02-01 08:00      10
2020-02-01 09:00       6          1   <- last row of 5h period

Please, can this be done in some way with some functions in pandas?

The ultimate goal is then to fill empty values by linear interpolation between 0 and 1 so as to get a progress in % of current row with respect to the 5h period.

Another track / question

Another approach could be to initialize a 2nd DataFrame with 5h PeriodIndex, initialize values of a new column to 1, and then upsample the PeriodIndex back to 1H to merge both DataFrames.

A shift(-1) would initialize the last row of the period.

I would repeat the process without the shift for the value 0.

Then, how can I create this new DataFrame so that I can merge it to the 1st? I tried some merge commands, but I have an error indicating me that both index don't have the same frequency.

Thanks for your help! Bests

pierre_j
  • 895
  • 2
  • 11
  • 26
  • if your data is really indexed with a 1h period, you can do `df['period5h'] = np.arange(len(df))%5/4` that would create the interpolation as well – Ben.T Apr 03 '20 at 17:49

3 Answers3

1

Not most pythonic approach but it works.

import pandas as pd
from random import seed, randint
from collections import OrderedDict
import time
p1h = pd.period_range(start='2020-02-01 00:00', end='2040-03-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

t1 = time.time()
for i in range(len(df['Values'])):
  if (i+1)% 5 == 1:
    df['Values'].iloc[i] = 0
  elif (i+1) % 5 == 0:
    df['Values'].iloc[i] = 1
t2 = time.time()
df.head(20)

print(t2-t1)


time: 8.770591259002686

Approach 2:

import pandas as pd
from random import seed, randint
from collections import OrderedDict
import time
p1h = pd.period_range(start='2020-02-01 00:00', end='2040-03-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

t1 = time.time()

df['Values'].iloc[range(0,len(df['Values']),5)] = 0
df['Values'].iloc[range(4,len(df['Values']),5)] = 1
t2 = time.time()
df.head(20)

print(t2-t1)

time: 0.009400367736816406

Zabir Al Nazi
  • 10,298
  • 4
  • 33
  • 60
  • Hello furcifer. Thanks for your proposal, but I am looking for a loop-less solution. Execution time is critical. – pierre_j Apr 03 '20 at 19:27
  • @furcufer, thanks a lot for your contribution. I am sorry, I really want to work with period. these can be virtually *any*: 5 hours, 1 day, 1 week, 1 month... But thanks nonetheless, I do appreciate your help! – pierre_j Apr 03 '20 at 20:52
  • that's exactly what it does!!! You can just calculate the length in the dataframe from the period and can use range. If period is the thing then it's definitely does what it says in very short time. – Zabir Al Nazi Apr 04 '20 at 06:13
1

Use the indices attribute of the resampled object to find the first and last indices of the groups. This will work even if the data doesn't have a regular frequency, or has a frequency that doesn't perfectly divide the resampling frequency. Groups will only have a single measurement get set to 1 as opposed to 0. Then we set the values accordingly

i1 = [] # Last `.iloc` index labels
i0 = [] # First `.iloc` index labels
for k,v in df.resample('5H').indices.items():
    i0.append(v[0])
    i1.append(v[-1])

df.loc[df.index[i0], 'period_5H'] = 0
df.loc[df.index[i1], 'period_5H'] = 1

                  Values  period_5H
p1h                                
2020-02-01 00:00       2        0.0
2020-02-01 01:00       9        NaN
2020-02-01 02:00       1        NaN
2020-02-01 03:00       4        NaN
2020-02-01 04:00       1        1.0
2020-02-01 05:00       7        0.0
2020-02-01 06:00       7        NaN
2020-02-01 07:00       7        NaN
2020-02-01 08:00      10        NaN
2020-02-01 09:00       6        1.0
2020-02-01 10:00       3        0.0
...
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Hi aLollz, I am looking for a loop-less solution. Actually I feel I could initialize a new DataFrame with '5H' PeriodIndex derived from my initial dataframe, I initialize a new column to value 1, then I perform a merge between two DataFrames. I can then do a shift-1 of the new column. Then repeat for value 0 without the shift. But I don't know how to abotain this 2nd DataFrame that I can merge. Do you have any idea? – pierre_j Apr 03 '20 at 19:22
  • @pierre_j it really depends whether you have regularly frequencies that evenly divide your resampling. If you don't then there's no simple way because `resample` will output a regularly sampled DataFrame and then you're stuck trying to guess what rows merged back to those. I guess you could use an `asof` merge, but it becomes exceedingly complicated compared to literally picking the first and last indices within each group. – ALollz Apr 03 '20 at 19:45
  • @ ALollz. yes the PeriodIndex is obtained with a period_range() function, so it is evenly spaced. – pierre_j Apr 03 '20 at 19:47
1

Ok, I finally setup to use following approach which is rather fast (no loop)

 super_pi = pd.period_range(start='2020-01-01 00:00', end='2020-06-01 00:00', freq='5h', name='p5h')
 super_df = pd.DataFrame({'End' : 1, 'Start' : 0}, index=super_pi).resample('1h').first()
 # We know last row is a 1 (end of period)
 super_df['End'] = super_df['End'].shift(-1, fill_value=1)
 super_df['Period'] = super_df[['End','Start']].sum(axis=1, min_count=1)

Result

 supder_df.head(10)

                   End  Start  Period
 p5h                                 
 2020-01-01 00:00  NaN    0.0     0.0
 2020-01-01 01:00  NaN    NaN     NaN
 2020-01-01 02:00  NaN    NaN     NaN
 2020-01-01 03:00  NaN    NaN     NaN
 2020-01-01 04:00  1.0    NaN     1.0
 2020-01-01 05:00  NaN    0.0     0.0
 2020-01-01 06:00  NaN    NaN     NaN
 2020-01-01 07:00  NaN    NaN     NaN
 2020-01-01 08:00  NaN    NaN     NaN

Bests,

pierre_j
  • 895
  • 2
  • 11
  • 26