Create a column B that stores the sum of a group in a column A but only at a specific value

Question

I have a dataframe (below) where I read in time series data of when lights were turned and left on and then turned off. I created a series of calculated columns in my dataframe to identify each cluster (continuous "On" clusters). What I could like to do is sum the total length of time of that cluster but only store that value at the earliest time (first "On") of each particular cluster. *** The desired outcome would look like column "Length"

I have all the necessary components to create the cluster but cannot figure out the code to sum the timesteps of that particular cluster and store the value at the earliest time.

ANY help would be greatly appreciated!

The data is also here (I WAS UNABLE REPLACE THE 0s IN COLUMNS "CLUSTER" and "LENGTH" with NaNs but They should be. The IMAGE is correct."

import pandas as pd
import numpy as np

data = {'Series':  ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A' ],
        'Time': [0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5],
        'TimeStep': [0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5],
        'Light': ["Off", "Off", "On", "On", "On", "Off", "Off", "Off", "On", "On", "On", "On", "On", "Off", "Off"],
        'Cluster': [0, 0, 1, 1, 1, 0, 0, 0, 1,1, 1, 1, 1, 0, 0],
        'ClusterLength': [0, 0, 1.5, 0, 0, 0, 0, 0, 2.5, 0, 0, 0, 0, 0, 0],              }

df = pd.DataFrame (data, columns = ['Series','Time','TimeStep','Light','Cluster','ClusterLength'])
df

Please **always** try to include your data as text, not images. — Quang Hoang, Jun 20 '20 at 03:50
@QuangHoang i added the datatable but I couldn't figure out how to have NaN in the column so I replaced with with 0s. Ideally this would be NaNs though. — NightLearner, Jun 20 '20 at 04:08

Valdi_Bo · Accepted Answer · 2020-06-20T11:49:08.063

My solution computes both Cluster and ClusterLength columns, just as in your picture, i.e. with empty strings as "absent" values.

In order not to overwrite your original columns, I named them Clust and ClustLen.

Start from defining a function generating the cluster length for the current group of rows (grouped by the cluster number):

def getClustLen(grp):
    rv = pd.Series('', index=grp.index)
    if type(grp.iloc[0].Clust) == int:
        rv.iloc[0] = grp.TimeStep.sum()
    return rv

Then, to generate Clust column, run:

df['Clust'] = (df.Light.ne(df.Light.shift()) & df.Light.eq('On')).cumsum()
df.Clust = df.Clust.where(df.Light == 'On', '')

And the final step is to generate ClustLen column:

df['ClustLen'] = df.groupby('Clust').apply(getClustLen)\
    .reset_index(level=0, drop=True)

The result is:

   Series  Time  TimeStep Light  Cluster  ClusterLength Clust ClustLen
0       A   0.5       0.5   Off        0            0.0               
1       A   1.0       0.5   Off        0            0.0               
2       A   1.5       0.5    On        1            1.5     1      1.5
3       A   2.0       0.5    On        1            0.0     1         
4       A   2.5       0.5    On        1            0.0     1         
5       A   3.0       0.5   Off        0            0.0               
6       A   3.5       0.5   Off        0            0.0               
7       A   4.0       0.5   Off        0            0.0               
8       A   4.5       0.5    On        2            2.5     2      2.5
9       A   5.0       0.5    On        2            0.0     2         
10      A   5.5       0.5    On        2            0.0     2         
11      A   6.0       0.5    On        2            0.0     2         
12      A   6.5       0.5    On        2            0.0     2         
13      A   7.0       0.5   Off        0            0.0               
14      A   7.5       0.5   Off        0            0.0

In the final version, to save new values under original names, change Clust to Cluster and ClustLen to ClusterLength.

Create a column B that stores the sum of a group in a column A but only at a specific value

1 Answers1