0

Having a data frame df1:

         DP 1     DP 2    DP 3   DP 4     DP 5    DP 6    DP 7   DP 8    DP 9    DP 10
OP 1    357848  1124788 1735330 2218270 2745596 3319994 3466336 3606286 3833515 3901463
OP 2    352118  1236139 2170033 3353322 3799067 4120063 4647867 4914039 5339085 
OP 3    290507  1292306 2218525 3235179 3985995 4132918 4628910 4909315     
OP 4    310608  1418858 2195047 3757447 4029929 4381982 4588268         
OP 5    443160  1136350 2128333 2897821 3402672 3873311             
OP 6    396132  1333217 2180715 2985752 3691712                 
OP 7    440832  1288463 2419861 3483130                     
OP 8    359480  1421128 2864498                         
OP 9    376686  1363294                             
OP 10   344014                                  

I would like to calculate the sum of every column by restricting the row number.

To calculate sum of first column data, Sum(DP1) where row size should be 10-1

To calculate sum of second column data, Sum(DP2) where row size should be 10-2

To calculate sum of Third column data, Sum(DP3) where row size should be 10-3

and so on..

Output is like this:

    3327371  10251249  15047844  18447791  17963259  15954957  12743113  8520325  3833515

I tried to do with for loop:

>>dataframe_len = len(df1.columns)
>>print(dataframe_len)
   10
>>for i in range(0,10):
     #Here i need to find the sum of each column 
     #sum('col')(row size is 10-i)

it is not about DP1 to DP10(10 column) there are too many columns are there.

Thank You For Your Time :)

  • 1
    Does `10-1` mean `10 to 1` or `10 minus 1`? Ie are you trying the skip the top rows or the bottom rows? Based on your expected output, it looks like skipping the bottom rows; or rather, skipping the last non-NA value. – aneroid Apr 10 '21 at 11:33

2 Answers2

1

Assuming you want it as per your expected output and not according to your description, sum() each column after dropping NA values and then skipping the last value:

df.apply(lambda col: col.dropna()[:-1].sum())

Output:

DP 1      3327371.0
DP 2     10251249.0
DP 3     15047844.0
DP 4     18447791.0
DP 5     17963259.0
DP 6     15954957.0
DP 7     12743113.0
DP 8      8520325.0
DP 9      3833515.0
DP 10           0.0

Side note: Your sums are not rows 10-1, 10-2, 10-3 etc. They are rows 9-1, 8-1, 7-1. ie. You're skipping the last non-NA value of each column, not the top rows.

Ex df['DP 1'].sum() is 3671385 but skipping the last row df['DP 1'][:-1].sum() is 3327371 which matches with your expected output. For DP2: df['DP 2'].sum() is 11614543 and df['DP 2'].dropna()[:-1].sum() is 10251249 (your expected val) but df['DP 2'][2:10].sum() is 9253616.

aneroid
  • 12,983
  • 3
  • 36
  • 66
  • 1
    ah even better, +1. I overcomplicated with `shift(-1).last_valid_index()` – tdy Apr 10 '21 at 11:24
  • 1
    I started with what you had put (but couldn't get it to work since the shift amount per column is different). Then realised we're just skipping the last val. – aneroid Apr 10 '21 at 11:25
0

I think you can make use of the information in column name when using apply()

def sum_row(col):
    t = int(col.name.split(' ')[-1])
    return col.iloc[:-t].sum()

df_ = df.apply(sum_row)
# print(df_)

DP 1      3327371.0
DP 2     10251249.0
DP 3     15047844.0
DP 4     18447791.0
DP 5     17963259.0
DP 6     15954957.0
DP 7     12743113.0
DP 8      8520325.0
DP 9      3833515.0
DP 10           0.0
dtype: float64
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52