4

I have a dataframe like the following:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total
-----------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5

and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:

def func(row, n):
    """Returns the number of events before the 
    first sequence of 0s of length n is found
    """

    idx = np.arange(0, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()

    #in case there is no sequence of 0s with length n
    try:
        e = c[c >= d.index[d >= n][0]]
        f = str(e.index[0])
    except IndexError:
        e = [90]
        f = str(e[0])

    idx_sliced = np.arange(0, int(f)+1)
    a = row[idx_sliced]

    if (int(f) + n > 90):
        perc_before = 100
    else:
        perc_before = a.cumsum().tail(1).values[0]/row['total']

    return perc_before

As is, the error I get is:

---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')

Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  %_before
---------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   43
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231   21
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78   90
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5   100

If trying to solve this, you can test by using this sample input:

a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])

df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()
RafaJM
  • 481
  • 6
  • 17
  • Including a dataframe which is truncated (`...`) is not very helpful. – Erfan Jan 06 '20 at 20:20
  • Can you include some sample input? `df.head().to_clipboard(False)` and paste it to your question – Code Different Jan 06 '20 at 20:21
  • I can't do this because I am running this notebook on a managed SageMaker AWS instance, but I added some sample input. – RafaJM Jan 06 '20 at 20:30
  • I might have just found something out. Apparently, the first element of the 'a' series by the end of the loop is returning the str in the ID column, which is nuts because I am not assign it this column. – RafaJM Jan 06 '20 at 20:59
  • You need to convert `idx` to array of strings: `idx = np.arange(0, 91).astype(str)`, then use `loc` to extract the columns: `a = row.loc[idx]`. The reason is integer vs. label indexing: `row[0]` gets the first value in `row`, while `row.loc[0]` get the row with label 0 – Code Different Jan 07 '20 at 18:42

4 Answers4

1

Give this a try:

def percent_before(row, n, ncols):
    """Return the percentage of activities happen before
    the first sequence of at least `n` consecutive 0s
    """
    start_index, i, size = 0, 0, 0
    for i in range(ncols):
        if row[i] == 0:
            # increase the size of the island
            size += 1
        elif size >= n:
            # found the island we want
            break
        else:
            # start a new island
            # row[start_index] is always non-zero
            start_index = i
            size = 0

    if size < n:
        # didn't find the island we want
        return 1
    else:
        # get the sum of activities that happen
        # before the island
        idx = np.arange(0, start_index + 1).astype(str)
        return row.loc[idx].sum() / row['total']

df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)

Result:

   0  1   2  3  4  5  6   7  8   9  10  11  12  13  14  total  percent_before
0  1  1  13  0  0  0  4   0  0   0   0   0  12   1   1     33        0.454545
1  1  1  13  0  0  0  4  12  1  12   3   0   0   5   1     53        0.283019
2  1  1  13  0  0  0  4  12  2   0   5   0   5   1   1     45        0.333333
3  1  1  13  0  0  0  4  12  1  12   4  50   0   0   1     99        0.151515
4  1  1  13  0  0  0  4  12  0   0   0  54   0   1   1     87        0.172414

For the full frame, call apply with ncols=91.

Code Different
  • 90,614
  • 16
  • 144
  • 163
  • This outputs the error message: ("None of [Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')] are in the [index]", 'occurred at index 0') The column indexes are strings. Is this the reason why? – RafaJM Jan 07 '20 at 17:18
  • 1
    Yes, that's the case. Subscripting a series (`row[i]`) is position based. While `row.loc[i]` (like in the answer) is label based. There's no label 0 the number, but there is a label 0 the string. See my edited answer – Code Different Jan 07 '20 at 17:28
1

Another possible solution:

def get_vals(df, n):
    df, out = df.T, []
    for col in df.columns:
        diff_to_previous = df[col] != df[col].shift(1)
        g = df.groupby(diff_to_previous.cumsum())[col].agg(['idxmin', 'size'])

        vals = df.loc[g.loc[g['size'] >= n, 'idxmin'].values, col]
        if len(vals):
            out.append( df.loc[np.arange(0, vals[vals == 0].index[0]), col].sum() / df[col].sum() )
        else:
            out.append( 1.0 )
    return out

df['percent_before'] = get_vals(df, n=3)
print(df)

Prints:

   0  1   2  3  4  5  6   7  8   9  10  11  12  13  14  percent_before
0  1  1  13  0  0  0  4   0  0   0   0   0  12   1   1        0.454545
1  1  1  13  0  0  0  4  12  1  12   3   0   0   5   1        0.283019
2  1  1  13  0  0  0  4  12  2   0   5   0   5   1   1        0.333333
3  1  1  13  0  0  0  4  12  1  12   4  50   0   0   1        0.151515
4  1  1  13  0  0  0  4  12  0   0   0  54   0   1   1        0.172414
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

As one of the comment of the previous question was about the speed, I guess you can try to vectorize the problem. I used this dataframe to try (slightly different than your original input):

  ID  0   1   2   3  4  5   6   7  8  total
0  A  2  21   0  18  3  0   0   0  2     46
1  B  0   0  12   2  0  8  14  23  0     59
2  C  0  38  19   3  1  3   3   7  1     75
3  D  3   0   0   1  0  0   0   0  0      4

Now what I think is chaining command to create a mask and find where the data is not equal to 0, then use cumsum along the column axis and see where the diff along the column is equal to 0. To find the first one, you can use cummax so that all the columns after (row-wise) are considered are True. Mask the original dataframe with the opposite of this mask, sum along the columns and divide by total. for example with n=2:

n=2
df['%_before'] = df[~(df.ne(0).cumsum(axis=1).diff(n, axis=1)[range(9)]
                        .eq(0).cummax(axis=1))].sum(axis=1)/df.total
print (df)
  ID  0   1   2   3  4  5   6   7  8  total  %_before
0  A  2  21   0  18  3  0   0   0  2     46  0.956522
1  B  0   0  12   2  0  8  14  23  0     59  0.000000
2  C  0  38  19   3  1  3   3   7  1     75  1.000000
3  D  3   0   0   1  0  0   0   0  0      4  0.750000

In your case, you need to change range(9) by range(91) to get all your columns

Ben.T
  • 29,160
  • 6
  • 32
  • 54
-1

You can do this using the rolling method.

For your example input, given the number of zeros is 5, we can use

df.rolling(window=5, axis=1).apply(lambda x : np.sum(x))

The output would like

    0   1   2   3     4     5     6     7     8     9    10    11    12    13  \
0 NaN NaN NaN NaN  15.0  14.0  17.0   4.0   4.0   4.0   4.0   0.0  12.0  13.0   
1 NaN NaN NaN NaN  15.0  14.0  17.0  16.0  17.0  29.0  32.0  28.0  16.0  20.0   
2 NaN NaN NaN NaN  15.0  14.0  17.0  16.0  18.0  18.0  23.0  19.0  12.0  11.0   
3 NaN NaN NaN NaN  15.0  14.0  17.0  16.0  17.0  29.0  33.0  79.0  67.0  66.0   
4 NaN NaN NaN NaN  15.0  14.0  17.0  16.0  16.0  16.0  16.0  66.0  54.0  55.0   

     14  
0  14.0  
1   9.0  
2  12.0  
3  55.0  
4  56.0  

Looking at the output, its very easy to see that in the first row, for column 11, since the value is 0, it means that starting at position 7, you have 5 zeros. Since none of the other rows have 0 in them, it means that none of the other rows have 5 contiguous zeros in them.

Roshan Santhosh
  • 677
  • 3
  • 9