Find the number of rows a value is away from a specific row value in a Python DataFrame

Question

I have the following dataframe, df, and I would like to add the 'distance' column to it, such that:

date	active	distance
01/09/2022	1	0
02/09/2022	0	1
05/09/2022	0	2
06/09/2022	0	3
07/09/2022	0	4
08/09/2022	1	0
09/09/2022	0	1

Here, the distance is how far away each row is from the previous value of '1' in the active column, with the distance being the number of business days. I have tried using the following:

df['distance'] = np.where(
    df['active'] == 1, 0, df['distance'].shift(1,fill_value=0).astype(int) + 1
)

But it seems that Python does not like me referencing a column as I am defining it. I tried to also define a function to run this but unsure how to do so using .shift() as this command seems necessary in order to use to take the previous value and add to it.

Other variations of the above code do not seem to work since Python really wants to concatenate the shift and the 1 instead of just summing them together.

score 1 · Answer 1 · answered Sep 09 '22 at 05:20

Create groups by compare 1 with Series.cumsum and cumulative count them by GroupBy.cumcount:

df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
print (df)
         date  active  distance
0  01/09/2022       1         0
1  02/09/2022       0         1
2  05/09/2022       0         2
3  06/09/2022       0         3
4  07/09/2022       0         4
5  08/09/2022       1         0
6  09/09/2022       0         1

score 0 · Answer 2 · answered Sep 09 '22 at 02:40

0

your column can be entirely defined from the "active" column. your formula is the same as:

count_up = pd.Series(np.arange(len(df)), index=df.index)
distance = count_up - count_up.where(df.active).ffill()

answered Sep 09 '22 at 02:40

Michael Delgado

13,789
3
29
54

Claudio · Answer 3 · 2022-09-09T08:08:38.293

There are sure myriads of approaches all getting the same result. Here are six of them:

# ======================================================================
# ----------------------------------------------------------------------
# Provided in another answers (and fixed if necessary)
# Using merely pandas own methods:
df['distance'] = df.groupby(df['active'].eq(1).cumsum()).cumcount()
#     nice pure pandas and short one - in my eyes the best choice
print(df)
# -------------------------------
cnt = pd.Series(np.arange(df.shape[0]), index=df.index)
distance = (cnt-cnt.where(df.active.astype(bool)).ffill()).astype(int)
df['distance'] = distance
#     a much longer pure pandas one
print(df)
# -------------------------------
g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
#     using in addition a function as replacement for .cumcount()
print(df)
# ======================================================================
# ----------------------------------------------------------------------
# Using a loop over values in column 'active':
d=[];c=-1
for i in df['active']:
    c+=1 
    if i: c = 0
    d.append(c)
df["distance"] = d
print(df)
# ----------------------------------------------------------------------
# Using a function  
c = -1
def f(i):
    global c    
    if i: c=0 
    else: c+=1; 
    return c
# -------------------------------
# with a list comprehension:
df['distance'] = [ f(i) for i in df['active'] ]
print(df)
# -------------------------------
# or pandas apply() function: 
df['distance'] = df['active'].apply(f)
print(df)

Below one of them including full code with data:

import pandas as pd
import numpy  as np
df_print = """\
date     active
01/09/2022   1  
02/09/2022   0  
05/09/2022   0  
06/09/2022   0  
07/09/2022   0  
08/09/2022   1  
09/09/2022   0"""
open('df_print', 'w').write(df_print)
df = pd.read_table('df_print', sep=r'\s\s\s*' ) # index_col = 0)
print(df)

distance =  []
counter   = -1
for index, row in df.iterrows():
    if row['active']:
        counter = 0
        distance.append(counter)
        continue
    counter +=1
    distance.append(counter)
df["distance"] = distance
print(df)

gives:

         date  active
0  01/09/2022       1
1  02/09/2022       0
2  05/09/2022       0
3  06/09/2022       0
4  07/09/2022       0
5  08/09/2022       1
6  09/09/2022       0
         date  active  distance
0  01/09/2022       1         0
1  02/09/2022       0         1
2  05/09/2022       0         2
3  06/09/2022       0         3
4  07/09/2022       0         4
5  08/09/2022       1         0
6  09/09/2022       0         1

score 0 · Answer 4 · answered Sep 09 '22 at 05:19

Use cumsum to mark the active groups.

g = (df['active']==1).cumsum()
df.assign(distance=g.groupby(g).transform(lambda x: range(len(x))))
print(df)

Result

          date  active  distance
0  01/09/2022        1         0
1  02/09/2022        0         1
2  05/09/2022        0         2
3  06/09/2022        0         3
4  07/09/2022        0         4
5  08/09/2022        1         0
6  09/09/2022        0         1

Find the number of rows a value is away from a specific row value in a Python DataFrame

4 Answers4