Downsampling in Pandas DataFrame by dividing observations into ratios

Question

Given a DataFrame having timestamp (ts), I'd like to these by the hour (downsample). Values that were previously indexed by ts should now be divided into ratios based on the number of minutes left in an hour. [note: divide data in ratios for NaN columns while doing resampling]

          ts               event  duration
0   2020-09-09 21:01:00 a         12
1   2020-09-10 00:10:00 a         22 
2   2020-09-10 01:31:00 a         130
3   2020-09-10 01:50:00 b         60
4   2020-09-10 01:51:00 b         50
5   2020-09-10 01:59:00 b         26
6   2020-09-10 02:01:00 c         72
7   2020-09-10 02:51:00 b         51
8   2020-09-10 03:01:00 b         63
9   2020-09-10 04:01:00 c         79



def create_dataframe():
    df = pd.DataFrame([{'duration':12, 'event':'a',  'ts':'2020-09-09 21:01:00'},
                         {'duration':22, 'event':'a',  'ts':'2020-09-10 00:10:00'},
                         {'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
                         {'duration':60, 'event':'b',  'ts':'2020-09-10 01:50:00'},
                         {'duration':50, 'event':'b',  'ts':'2020-09-10 01:51:00'},
                         {'duration':26, 'event':'b',  'ts':'2020-09-10 01:59:00'},
                         {'duration':72, 'event':'c',  'ts':'2020-09-10 02:01:00'},
                         {'duration':51, 'event':'b',  'ts':'2020-09-10 02:51:00'},
                         {'duration':63, 'event':'b',  'ts':'2020-09-10 03:01:00'},
                         {'duration':79, 'event':'c',  'ts':'2020-09-10 04:01:00'},
                         {'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
                         ])
    df.ts = pd.to_datetime(df.ts)
    return df

I want to estimate a produced based on the ratio of time spend and produced. This can be compared to how many lines of code have been completed or find how many actual lines per hour?

for example: at "2020-09-10 00:10:00" we have 22. Then during the period from 21:01 - 00:10, we produced based on

 59 min of 21:00 hours -> 7   => =ROUND(22/189*59,0)
 60 min of 22:00 hours -> 7   => =ROUND(22/189*60,0)
 60 min of 23:00 hours -> 7   => =ROUND(22/189*60,0)
 10 min of 00:00 hours -> 1   => =ROUND(22/189*10,0)

the result should be something like.

  ts                  event    duration
0   2020-09-09 20:00:00 a         NaN
1   2020-09-10 21:00:00 a         7
2   2020-09-10 22:00:00 a         7
3   2020-09-10 23:00:00 a         7
4   2020-09-10 00:00:00 a         1
5   2020-09-10 01:00:00 b         ..
6   2020-09-10 02:01:00 c         ..

Problem with this approach:

It appears to me that, we are having a serious issue with this approach. If you look at the rows[1] -> 2020-09-10 07:00:00, we have 4, we need to divide it between 3 hours. Considering base duration value as 1 (base unit), we however get

def create_dataframe2():
    df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
                        {'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
    df.ts = pd.to_datetime(df.ts)
    return df

Source

    duration    event   c   ts
0   4   c   event3.5    2020-09-10 07:00:00
1   4   c   event3.5    2020-09-10 10:00:00

Expected Output

    ts_hourly           mins    duration
0   2020-09-10 07:00:00 60.0    2
1   2020-09-10 08:00:00 60.0    1
2   2020-09-10 09:00:00 60.0    1
3   2020-09-10 10:00:00 0.0     0

Your printout of the source DataFrame is inconsistent with the code to create this DataFrame. E.g. in row 3 your code contains *00:31:00* whereas the printout - *01:31:00*. Another discrepancy is that the code contains *a* and *b* columns, but the printout - *produced* and *event*. Note also that *c* column in the code is not needed. — Valdi_Bo, Dec 08 '20 at 09:55

Valdi_Bo · Accepted Answer · 2020-12-08T14:48:53.340

The first step is to add "previous ts" column to the source DataFrame:

df['tsPrev'] = df.ts.shift()

Then set ts column as the index:

df.set_index('ts', inplace=True)

The third step is to create an auxiliary index, composed of the original index and "full hours":

ind = df.event.resample('H').asfreq().index.union(df.index)

Then create an auxiliary DataFrame, reindexed with the just created index and "back fill" event column:

df2 = df.reindex(ind)
df2.event = df2.event.bfill()

Define a function to be applied to each group of rows from df2:

def parts(grp):
    lstRow = grp.iloc[-1]       # Last row from group
    if pd.isna(lstRow.tsPrev):  # First group
        return pd.Series([lstRow.duration], index=[grp.index[0]], dtype=int)
    # Other groups
    return -pd.Series([0], index=[lstRow.tsPrev]).append(grp.duration)\
        .interpolate(method='index').round().diff(-1)[:-1].astype(int)

Then generate the source data for "produced" column in 2 steps:

Generate detailed data:

prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1],
    sort=False).apply(parts).reset_index(level=0, drop=True)

The source is df2 grouped this way that each group is terminated with a row with a non-null value in duration column. Then each group is processed with parts function.

The result is:

2020-09-09 21:00:00    12
2020-09-09 21:01:00     7
2020-09-09 22:00:00     7
2020-09-09 23:00:00     7
2020-09-10 00:00:00     1
2020-09-10 00:10:00    80
2020-09-10 01:00:00    50
2020-09-10 01:31:00    60
2020-09-10 01:50:00    50
2020-09-10 01:51:00    26
2020-09-10 01:59:00    36
2020-09-10 02:00:00    36
2020-09-10 02:01:00    51
2020-09-10 02:51:00    57
2020-09-10 03:00:00     6
2020-09-10 03:01:00    78
2020-09-10 04:00:00     1
2020-09-10 04:01:00    85
2020-09-10 05:00:00    87
2020-09-10 06:00:00     7
dtype: int32

Generate aggregated data, for the time being also as a Series:

prod = prodDet.resample('H').sum().rename('produced')

This time prodDet is resampled (broken down by hours) and the result is the sum of values.

The result is:

2020-09-09 21:00:00     19
2020-09-09 22:00:00      7
2020-09-09 23:00:00      7
2020-09-10 00:00:00     81
2020-09-10 01:00:00    222
2020-09-10 02:00:00    144
2020-09-10 03:00:00     84
2020-09-10 04:00:00     86
2020-09-10 05:00:00     87
2020-09-10 06:00:00      7
Freq: H, Name: produced, dtype: int32

Let's describe the content of prodDet:

There is no row for 2020-09-09 20:00:00, because no source row is from this hour (your data start from 21:01:00).
Row 21:00:00 12 comes from the first source row (you forgot about it writing the expected result).
Rows for 21:01:00, 22:00:00, 23:00:00 and 00:00:00 come from "partitioning" of row 00:10:00 a 22, just as a part of your expected result.
Rows with 80 and 50 come from row containing 130, divided between rows with hours 00:01:00 and 01:00:00.
And so on.

Now we start to assemble the final result.

Join prod (converted to a DataFrame) with event column:
```
result = prod.to_frame().join(df2.event)
```

Add tsMin column - the minimal ts in each hour (as you asked in one of comments):

result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())

Change the index into a regular column and set its name to ts (like in the source DataFrame):
```
result = result.reset_index().rename(columns={'index': 'ts'})
```

The final result is:

                   ts  produced event               tsMin
0 2020-09-09 21:00:00        19     a 2020-09-09 21:01:00
1 2020-09-09 22:00:00         7     a                 NaT
2 2020-09-09 23:00:00         7     a                 NaT
3 2020-09-10 00:00:00        81     a 2020-09-10 00:10:00
4 2020-09-10 01:00:00       222     a 2020-09-10 01:31:00
5 2020-09-10 02:00:00       144     c 2020-09-10 02:01:00
6 2020-09-10 03:00:00        84     b 2020-09-10 03:01:00
7 2020-09-10 04:00:00        86     c 2020-09-10 04:01:00
8 2020-09-10 05:00:00        87     c                 NaT
9 2020-09-10 06:00:00         7     c 2020-09-10 06:05:00

E.g. the value of 81 for 00:00:00 is a sum of 1 and 80 (the first part resulting from row with 130), see prodDet above.

Some values in tsMin column are empty, for hours in which there is no source row.

If you want to totally drop the result from the first row (with duration == 12), change return pd.Series([lstRow.duration]... to return pd.Series([0]... (the 4-th row of parts function).

To sum up, my solution is more pandasonic and significantly shorter than yours (17 rows (my solution) vs. about 70 (yours), excluding comments).

why does not values at 2020-09-10 00:00:00 sum at 81? It should certainly match. `3 2020-09-10 00:00:00 10.0 1.0 0 2020-09-10 00:00:00 50.0 80.0 1 2020-09-10 01:00:00 31.0 50.0 0 2020-09-10 01:00:00 19.0 60.0` — s_mj, Dec 08 '20 at 09:33
The reason is the discrepancy between your code sample to create *df* and the printout (above it). I pointed it out in a comment to your post. Your code contains row with *'a':130, 'b':'a', ..., 'ts':'2020-09-10 00:31:00'* (and I used just this code). So this source row (with *130*) pertains to activity between *00:10:00* and *00:31:00* and in my intermediate result it is presented as *2020-09-10 00:10:00 130*. Start with the *df* created by your **code sample** and then check details. — Valdi_Bo, Dec 08 '20 at 10:12
Thanks! @Valdi_Bo. I have changed the DF definition. can you revalidate your answer and update the column name post which I will accept. I am glad for your help in solving this. Also, if you can add mins in all the hours, might help other users! — s_mj, Dec 08 '20 at 12:26
I corrected my solution, adding also *tsMin* column (mins in all the hours). — Valdi_Bo, Dec 08 '20 at 14:50
Thanks, I was looking for getting duration in mins in each hour as against what you gave. can you add that as well? — s_mj, Dec 08 '20 at 18:05
I realized that column named *duration* is actually not any duration. Example: Look at rows with times *00:10:00* and *01:31:00*. The time distance between them is *81* mins, but *duration* in row with time *01:31:00* is *130*. So this duration expresses something other than the time distance, and this DataFrame contains no data about how long did it take to execute the task pertaining to each row. — Valdi_Bo, Dec 08 '20 at 18:49
This is sample data, and fields here have no meaning :) I got your code working., I request you to add the `mins` field which has a value of `60` for the first row having a `duration` value as `19` i.e. `(12 + 7)` — s_mj, Dec 08 '20 at 19:24
It is still not clear how to compute values for *mins* column. — Valdi_Bo, Dec 09 '20 at 07:57
I updated the question with a problem in this approach. I found a use case, where rounding Error is changing an interpretation. I agree logic is good but another scenario if fixed could be helpful! A solution that I have already added here. https://stackoverflow.com/questions/65217053/distribute-value-in-buckets — s_mj, Dec 09 '20 at 14:56

score 0 · Answer 2 · answered Dec 07 '20 at 17:55

I was not able to find a solution in pandas, so I created a solution with plain python.

Basically, I am iterating over all the values after sorting and sending two datetimes viz start_time and end_time to a function, which does the processing.

def get_ratio_per_hour(start_time: list, end_time: list, data_: int):

    # get total hours between the start and end, use this for looping
    totalhrs = lambda x: [1 for _ in range(int(x // 3600))
                          ] + [
                             (x % 3600 / 3600 
                              or 0.1 # added for loop fix afterwards
                              )]
              
    
    # check if Start and End are not in same hour
    if start_time.hour != end_time.hour:
        seconds = (end_time - start_time).total_seconds()
        if seconds < 3600:
            parts_ = [1] + totalhrs(seconds)
        else:
            parts_ = totalhrs(seconds)
    else:                       
        # parts_ define the loop iterations
        parts_ = totalhrs((end_time - start_time).total_seconds())
    
    sum_of_hrs = sum(parts_)
    
    # for Constructing DF
    new_hours = []
    mins = []

    # Clone data
    start_time_ = start_time
    end_time_ = end_time

    for e in range(len(parts_)):
        # print(parts_[e])
        if sum_of_hrs != 0:

            
            if sum_of_hrs > 1:
                
                if end_time_.hour != start_time_.hour:

                    # Floor > based on the startTime +1 hour
                    floor_time = (start_time_ + timedelta(hours=1)).floor('H')
    
                    #
                    new_hours.append(start_time_.floor('H'))
                    mins.append((floor_time - start_time_).total_seconds() // 60)
    
                    sum_of_hrs = sum_of_hrs - 1
    
                    start_time_ = floor_time
                
                else:
                    
                    # Hour is same.
                    floor_time = (start_time_ + timedelta(hours=1)).floor('H')
                    new_hours.append(start_time_.floor('H'))
                    mins.append((floor_time - start_time_).total_seconds() // 60)
                    
                    sum_of_hrs = sum_of_hrs - 1

                    start_time_ = floor_time
                    
            else:

                
                if end_time_.hour != start_time_.hour:
                    
                    # Get round off hour
                    floor_time = (end_time_ + timedelta(hours=1)).floor('H')

                    new_hours.append(end_time_.floor('H'))

                    mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
                                )
                    sum_of_hrs = sum_of_hrs - 1

                    start_time_ = floor_time

                else:

                    # Hour is same.
                    floor_time = (end_time_ + timedelta(hours=1)).floor('H')

                    new_hours.append(end_time_.floor('H'))
                    mins.append((end_time_ - start_time_).total_seconds() // 60)
                    sum_of_hrs = sum_of_hrs - 1

                    start_time_ = floor_time

    # Get DataFrame Build

    df_out = pd.DataFrame()

    df_out['hours'] = pd.Series(new_hours)
    df_out['mins'] = pd.Series(mins)

    df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])

    return df_out

Now, let's run the code for each iteration

time_val=[]
split_f_val=[]

split_field = 'duration'
time_field = 'ts'
# creating DataFrames for intermediate results!
df_final = pd.DataFrame()
df2 = pd.DataFrame()

for ix, row in df.iterrows():

    time_val.append(row[str(time_field)])
    split_f_val.append(int(row[str(split_field)]))
    
    # Skipping First Element for Processing. Therefore, having minimum two data values
    if ix !=0:
        
        # getting Last Two Values
        new_time_list=time_val[-2:]
        new_data_list=split_f_val[-2:]
        
        # get times to compare
        start_time=new_time_list[: -1][0]
        end_time=new_time_list[1:][0]
        
        # get latest Data to divide
        data_ = new_data_list[1:][0]
        
        # print(start_time)
        # print(end_time)
        df2 = get_ratio_per_hour(start_time,end_time, data_ )   
        df_final = pd.concat([df_final
                              , df2], ignore_index=True)
        
    else:
        
        # Create Empty DataFrame for First Value.
        df_final = pd.DataFrame([[np.nan,np.nan,np.nan] ], 
                     columns=['hours', 'mins', 'ratios'])
              
               
        df_final = pd.concat([df_final
                              , df2], ignore_index=True)
                     
    result = df_final.groupby(['hours'])['ratios'].sum()

Intermediate DataFrame:

                  hours mins    ratios
0           
0   2020-09-09 21:00:00 59.0    7.0
1   2020-09-09 22:00:00 60.0    7.0
2   2020-09-09 23:00:00 60.0    7.0
3   2020-09-10 00:00:00 10.0    1.0
0   2020-09-10 00:00:00 50.0    80.0
1   2020-09-10 01:00:00 31.0    50.0
0   2020-09-10 01:00:00 19.0    60.0
0   2020-09-10 01:00:00 1.0 50.0
0   2020-09-10 01:00:00 8.0 26.0
0   2020-09-10 01:00:00 1.0 36.0
1   2020-09-10 02:00:00 1.0 36.0
0   2020-09-10 02:00:00 50.0    51.0
0   2020-09-10 02:00:00 9.0 57.0
1   2020-09-10 03:00:00 1.0 6.0
0   2020-09-10 03:00:00 59.0    78.0
1   2020-09-10 04:00:00 1.0 1.0
0   2020-09-10 04:00:00 59.0    85.0
1   2020-09-10 05:00:00 60.0    87.0
2   2020-09-10 06:00:00 5.0 7.0

Final Output:

            hours   ratios
2020-09-09 21:00:00 7.0
2020-09-09 22:00:00 7.0
2020-09-09 23:00:00 7.0
2020-09-10 00:00:00 81.0
2020-09-10 01:00:00 222.0
2020-09-10 02:00:00 144.0
2020-09-10 03:00:00 84.0
2020-09-10 04:00:00 86.0
2020-09-10 05:00:00 87.0
2020-09-10 06:00:00 7.0

Downsampling in Pandas DataFrame by dividing observations into ratios

2 Answers2