Sum values of column by row and get the column that hits x % of total

Question

Okay, this is a bit difficult to describe. Basically, I have a dataframe like the following, where each column (from 0-90) represent a day, from day-0 to 90 days after day-90, and the numbers on each cell are sums of a certain feature for that ID in that day:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total
------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5

What I need is the first one is a new column that displays the number of the column at which the sum of previous values hit 90% of the 'total' value. Example: For the first row, I want to know when (which day, somewhere between 0-90) the sum of values for that ID hit 90% of the total, which is 156. I would have a new column that would hold the value for the day that summed up to 90% of the total, something like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  day_90%
--------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   54

convert the data to rows using the pandas `melt` function, instead of having them with columns. Then it becomes a simple matter of sorting and using `cumsum` function — usernamenotfound, Jan 02 '20 at 22:10
I'm actually quite lost for this one, I don't really know where to start. — RafaJM, Jan 02 '20 at 22:10

score 2 · Accepted Answer · answered Jan 02 '20 at 22:10

2

Try this:

df_pct = df.cumsum(axis=1).div(df.sum(axis=1), axis=0)

df['day_90%'] = (df_pct > .9).idxmax(axis=1)

Use idxmax to return the first time True where cumulative percent is greater than .9.

answered Jan 02 '20 at 22:10

Scott Boston

147,308
15
139
187

1

I had to add .loc[:, '0':'90'] since I added the total beforehand, but yes it works perfectly. Thanks a million =) – RafaJM Jan 03 '20 at 14:04
I added another question for a sort of sequence to this one, as it is another thing that is a bit out of my league. If you could take a look, here is the link for it: https://stackoverflow.com/questions/59581340/length-of-first-sequence-of-zeros-of-given-size-after-certain-column-in-pandas-d – RafaJM Jan 03 '20 at 15:40

Sum values of column by row and get the column that hits x % of total

1 Answers1