3

Suppose I have a dataframe like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90
-------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   47
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231   35
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78   16
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5   3

where the last column [day_90] contains the value of which column ([0] - [90]) accumulates 90% of the [total] for each row. To clarify, take the first row as an example: in the 47th column, the ID A hits a total of 90% of 156 events that he will achieve in 90 days.

What I need is: for each row, count the length of the first sequence of 0s that is bigger than 7 (or any arbitrary number predefined). So, for example: for the first row, I want to know how long is the first sequence of zeros after column 47, but only if the sequence exceeds 7 zeros in a row. If there are 6 zeros and then a non-zero, then I don't want to count it.

Finally, I want to store this result in a new column after [day_90]. So if ID A has a sequence of 10 zeros right after column 47, I want to add a new column [0_sequence] that holds the value of 10 for that ID.

I really have no idea where to start. Any help is appreciated =)

jmtb28
  • 138
  • 3
  • 11
RafaJM
  • 481
  • 6
  • 17

2 Answers2

5

Your problem is basically a variant of the island-and-gap problem: a non-zero creates a new "island" while a 0 extend the current island. And you want to find the first island that is of a certain size. Before I answer your question, let me walk you through a minified version of the problem.

Let's say you have a Series:

>>> a = pd.Series([0,0,0,13,0,0,4,12,0,0])
0     0
1     0
2     0
3    13
4     0
5     0
6     4
7    12
8     0
9     0

And you want to find the length of the first sequence of 0s that is at least 3-element in length. Let's first assign them into "islands":

# Every time the number is non-zero, a new "island" is created
>>> b = (a != 0).cumsum()
0    0  <-- island 0
1    0
2    0
3    1  <-- island 1
4    1
5    1
6    2  <-- island 2
7    3  <-- island 3
8    3
9    3

For each island, we are only interested in elements that are equal to 0:

>>> c = b[a == 0]
0    0
1    0
2    0
4    1
5    1
8    3
9    3

Now let's determine the size of each island:

>>> d = c.groupby(c).count()
0    3  <-- island 0 is of size 3
1    2  <-- island 1 is of size 2
3    2  <-- island 3 is of size 2
dtype: int64

And filter for islands whose size >= 3:

>>> e = d[d >= 3]
0    3

The answer is the first element of e (island 0, size 3) if e is not empty. Otherwise, there's no island meeting our criteria.


First Attempt

And applying it to your problem:

def count_sequence_length(row, n):
    """Return of the length of the first sequence of 0
    after the column in `day_90` whose length is >= n
    """
    if row['day_90'] + n > 90:
        return 0
    
    # The columns after `day_90`
    idx = np.arange(row['day_90']+1, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()
    e = d[d >= n]
    
    return 0 if len(e) == 0 else e.iloc[0]

df['0_sequence'] = df.apply(count_sequence_length, n=7, axis=1)

Second Attempt

The above version is nice, but slow because it calculates the size of all islands. Since you only care about the size of first the island meeting the criteria, a simple for loop works much faster:

def count_sequence_length_2(row, n):
    if row['day_90'] + n > 90:
        return 0
    
    size = 0
    for i in range(row['day_90']+1, 91):
        if row[i] == 0:
            # increase the size of the current island
            size += 1
        elif size >= n:
            # found the island we want. Search no more
            break
        else:
            # create a new island
            size = 0
    return size if size >= n else 0

df['0_sequence'] = df.apply(count_sequence_length_2, n=7, axis=1)

This achieves a speed up between 10 - 20x on when I benchmark it.

Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Wonderful explanation and it works exactly as it should! Takes a while to run in a large dataset, but not a problem at all. Thank you very much! – RafaJM Jan 03 '20 at 19:32
  • Thanks for the edit! Much faster. I am now trying to modify this to serve other needs, and I can't seem to make it work. If you could take a look at https://stackoverflow.com/questions/59618058/percentage-of-events-before-and-after-a-sequence-of-zeros-in-pandas-rows, I would really apreciate it. – RafaJM Jan 06 '20 at 19:47
1

Here is my solution, see the comments in the code:

import numpy as np, pandas as pd
import io

# Test data:
text="""  ID  0   1   2   3  4  5   6   7  8  day_90
        0  A  2  21   0  18  3  0   0   0  2       4
        1  B  0  20  12   2  0  8  14  23  0       5
        2  C  0  38  19   3  1  3   3   7  1       1
        3  D  3   0   0   1  0  0   0   0  0       0"""

df= pd.read_csv( io.StringIO(text),sep=r"\s+",engine="python")
#------------------------

# Convert some column names into integer:
cols= list(range(9))
df.columns= ["ID"]+ cols +["day_90"]

#----------
istart,istop= df.columns.get_loc(0), df.columns.get_loc(8)+1
# The required length of the 1st zero sequence:
lseq= 2

# The function for aggregating: this is the main calculation, 'r' is a row of 'df':
def zz(r):

     s= r.iloc[r.day_90+istart:istop] # get the day columns starting with as fixed in 'day_90'
     #--- Manipulate 's' to make possible using 'groupby' for getting different sequences:
     crit=s.eq(0)
     s= pd.Series(np.where(crit, np.nan, np.arange(len(s))),index=s.index)
     if np.isnan(s.iloc[0]):
       s.iloc[0]= 1
     s= s.ffill()
     s[~crit]= np.nan
     #---
     # get the sequences and their sizes:
     ssiz= s.groupby(s).size()
     return ssiz.iloc[0] if len(ssiz) and ssiz.iloc[0]>lseq else np.nan
#---

df["zseq"]= df.agg(zz,axis=1)

ID  0   1   2   3  4  5   6   7  8  day_90  zseq
0  A  2  21   0  18  3  0   0   0  2       4   3.0
1  B  0  20  12   2  0  8  14  23  0       5   NaN
2  C  0  38  19   3  1  3   3   7  1       1   NaN
3  D  3   0   0   1  0  0   0   0  0       0   NaN
kantal
  • 2,331
  • 2
  • 8
  • 15