5

My dataframe looks like this;

df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})

If col1 contains the value 1 in column 2 I want to forward fill with 1 n number of times. For example, if n = 4 then I would need the result to look like this.

df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1]})

I think I could do this using a for loop with a counter that resets every time a condition occurs but is there a faster way to produce the same result?

Thanks!

redbaron1981
  • 407
  • 3
  • 9

4 Answers4

4

Approach #1 : A NumPy based one with 1D convolution -

N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')

A more compact one-liner -

df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')

Approach #2 : Here's one with SciPy's binary_dilation -

from scipy.ndimage.morphology import binary_dilation

N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')

Approach #3 : Squeeze out the best off NumPy with it's strided-view-based tool -

from skimage.util.shape import view_as_windows

N = 4 # window size
mask = df.Col1.values==1
w = view_as_windows(mask,N)
idx = len(df)-(N-mask[-N:].argmax())
if mask[-N:].any():
    mask[idx:idx+N-1] = 1
w[mask[:-N+1]] = 1
df['Col2'] = mask.view('i1')

Benchmarking

Setup with given sample scaled up by 10,000x -

In [67]: df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
    ...:                    ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
    ...: 
    ...: df = pd.concat([df]*10000)
    ...: df.index = range(len(df.index))

Timings

# @jezrael's soln
In [68]: %%timeit
    ...: n = 3
    ...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# App-1 from this post
In [72]: %%timeit
    ...: N = 4 # window size
    ...: K = np.ones(N,dtype=bool)
    ...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# App-2 from this post
In [70]: %%timeit
    ...: N = 4 # window size
    ...: K = np.ones(N,dtype=bool)
    ...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# App-3 from this post
In [35]: %%timeit
    ...: N = 4 # window size
    ...: mask = df.Col1.values==1
    ...: w = view_as_windows(mask,N)
    ...: idx = len(df)-(N-mask[-N:].argmax())
    ...: if mask[-N:].any():
    ...:     mask[idx:idx+N-1] = 1
    ...: w[mask[:-N+1]] = 1
    ...: df['Col2_4'] = mask.view('i1')
1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @yatu's soln
In [71]: %%timeit
    ...: n = 4
    ...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
    ...: df.loc[ix, 'Col2_5'] = 1
7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Divakar
  • 218,885
  • 19
  • 262
  • 358
3

For general solution replace non 1 values to missing values by Series.where and forward filling 1 values with limit parameter, last replace missing values by originals:

n = 3
df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)

print (df)
    Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Here's a NumPy based approach using np.flatnonzero to obtain the indices where Col1 is 1 and taking the broadcast sum with an arange up to n:

n = 4
ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
df.loc[ix, 'Col2'] = 1

print(df)

     Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1
yatu
  • 86,083
  • 12
  • 84
  • 139
2

Something with reindex

N=4
s=df.loc[df.Col1==1,'Col1']
idx=s.index
s=s.reindex(idx.repeat(N))
s.index=(idx.values+np.arange(N)[:,None]).ravel('F')

df.Col2.update(s)
df
    Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1
BENY
  • 317,841
  • 20
  • 164
  • 234