0

I have the following dataframe, df, and I would like to add the 'distance' column to it, such that:

date active distance
01/09/2022 1 0
02/09/2022 0 1
05/09/2022 0 2
06/09/2022 0 3
07/09/2022 0 4
08/09/2022 1 0
09/09/2022 0 1

Here, the distance is how far away each row is from the previous value of '1' in the active column, with the distance being the number of business days. I have tried using the following:

df['distance'] = np.where(
    df['active'] == 1, 0, df['distance'].shift(1,fill_value=0).astype(int) + 1
)

But it seems that Python does not like me referencing a column as I am defining it. I tried to also define a function to run this but unsure how to do so using .shift() as this command seems necessary in order to use to take the previous value and add to it.

Other variations of the above code do not seem to work since Python really wants to concatenate the shift and the 1 instead of just summing them together.

cottontail
  • 10,268
  • 18
  • 50
  • 51

4 Answers4

1

Create groups by compare 1 with Series.cumsum and cumulative count them by GroupBy.cumcount:

df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
print (df)
         date  active  distance
0  01/09/2022       1         0
1  02/09/2022       0         1
2  05/09/2022       0         2
3  06/09/2022       0         3
4  07/09/2022       0         4
5  08/09/2022       1         0
6  09/09/2022       0         1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

your column can be entirely defined from the "active" column. your formula is the same as:

count_up = pd.Series(np.arange(len(df)), index=df.index)
distance = count_up - count_up.where(df.active).ffill()
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
0

There are sure myriads of approaches all getting the same result. Here are six of them:

# ======================================================================
# ----------------------------------------------------------------------
# Provided in another answers (and fixed if necessary)
# Using merely pandas own methods:
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
#     nice pure pandas and short one - in my eyes the best choice
print(df)
# -------------------------------
cnt = pd.Series(np.arange(df.shape[0]), index=df.index)
distance = (cnt-cnt.where(df.active.astype(bool)).ffill()).astype(int)
df['distance'] = distance
#     a much longer pure pandas one
print(df)
# -------------------------------
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
#     using in addition a function as replacement for .cumcount()
print(df)
# ======================================================================
# ----------------------------------------------------------------------
# Using a loop over values in column 'active':
d=[];c=-1
for i in df['active']:
    c+=1 
    if i: c = 0
    d.append(c)
df["distance"] = d
print(df)
# ----------------------------------------------------------------------
# Using a function  
c = -1
def f(i):
    global c    
    if i: c=0 
    else: c+=1; 
    return c
# -------------------------------
# with a list comprehension:
df['distance'] = [ f(i) for i in df['active'] ]
print(df)
# -------------------------------
# or pandas apply() function: 
df['distance'] = df['active'].apply(f)
print(df)

Below one of them including full code with data:

import pandas as pd
import numpy  as np
df_print = """\
date     active
01/09/2022   1  
02/09/2022   0  
05/09/2022   0  
06/09/2022   0  
07/09/2022   0  
08/09/2022   1  
09/09/2022   0"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'\s\s\s*' ) # index_col = 0)
print(df)

distance =  []
counter   = -1
for index, row in df.iterrows():
    if row['active']:
        counter = 0
        distance.append(counter)
        continue
    counter +=1
    distance.append(counter)
df["distance"] = distance
print(df)

gives:

         date  active
0  01/09/2022       1
1  02/09/2022       0
2  05/09/2022       0
3  06/09/2022       0
4  07/09/2022       0
5  08/09/2022       1
6  09/09/2022       0
         date  active  distance
0  01/09/2022       1         0
1  02/09/2022       0         1
2  05/09/2022       0         2
3  06/09/2022       0         3
4  07/09/2022       0         4
5  08/09/2022       1         0
6  09/09/2022       0         1
Claudio
  • 7,474
  • 3
  • 18
  • 48
0

Use cumsum to mark the active groups.

g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
print(df)

Result

          date  active  distance
0  01/09/2022        1         0
1  02/09/2022        0         1
2  05/09/2022        0         2
3  06/09/2022        0         3
4  07/09/2022        0         4
5  08/09/2022        1         0
6  09/09/2022        0         1
jch
  • 3,600
  • 1
  • 15
  • 17