1

How do I down-sample (linearly) one dataframe (counts at some distribution of diameters, logged at the lower bound, so the first entry is 0 counts between 296.54 and 303.14 nm, the second entry is 1 count between 303.14 and 311.88 nm etc).

   296.54  303.14  311.88  320.87  ...  359.49  369.86  380.52  391.49
a       0       1       2       3  ...       7       8       9      10
b      11      12      13      14  ...      18      19      20      21
c      22      23      24      25  ...      29      30      31      32
d      33      34      35      36  ...      40      41      42      43
e      44      45      46      47  ...      51      52      53      54
f      55      56      57      58  ...      62      63      64      65
g      66      67      68      69  ...      73      74      75      76
h      77      78      79      80  ...      84      85      86      87
i      88      89      90      91  ...      95      96      97      98
j      99     100     101     102  ...     106     107     108     109

to a new dataframe by resampling the counts to a coarser set of diameters. Like this (fist entry is counts between 300 up to 325 nm, etc):

         300                          325               350               375
a       4.34  interp sum btwn 325 and 350  btwn 350 and 375  btwn 375 and 400
b  and so on                                                                 
c                                                                            
d                                                                            
e                                                                            
f                                                                            
g                                                                            
h                                                                            
i                                                                            
j                                                                            

Is there a Pandas interpolate function, but downsampling by a linear sum, rather than upsampling?

I tried something like this:

    test_array=(np.arange(110)).reshape(10,11)
    index_list=list(string.ascii_lowercase)[:10]
    df=pd.DataFrame(test_array, index=index_list)
    df.columns=  [296.54,303.14,311.88,320.87,330.12,339.63,349.42,359.49,369.86,380.52,391.49]
    new_columns=[300,325,350,375]

    new_df=test_df.groupby(new_columns, axis=1).sum()

But that doesn't work. Gives the obvious key error. One proposed solution was to use the index location, but that didn't interpolate across partial values.

many thanks

bernski
  • 33
  • 7

1 Answers1

1

groupby is the way to go:

df.groupby(np.arange(df.shape[1])//3, axis=1).sum()

Output:

     0    1    2    3
a    3   12   21   30
b   39   48   57   66
c   75   84   93  102
d  111  120  129  138
e  147  156  165  174
f  183  192  201  210
g  219  228  237  246
h  255  264  273  282
i  291  300  309  318
j  327  336  345  354

update: per comment:

new_columns=np.array([0,2,7,10]) 
new_col_idx = np.searchsorted(new_columns,df.columns)

df.groupby(new_col_idx, axis=1).sum()

Output:

     0    1    2    3    4
a    0    3   25   27   11
b   12   27   85   63   23
c   24   51  145   99   35
d   36   75  205  135   47
e   48   99  265  171   59
f   60  123  325  207   71
g   72  147  385  243   83
h   84  171  445  279   95
i   96  195  505  315  107
j  108  219  565  351  119
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • What if the columns are not an even division of the primary ones? Like new_columns=np.array([0,2,7,10]) new_df=test_df.groupby(new_columns, axis=1).sum() – bernski Feb 22 '21 at 20:30
  • I think you miss the point. The new set of diameters is an entirely different array of values. It's not like I'm adding every three columns. Imagine the first set column values as 40 values between 0 and 20, every 0.5 and the second set being 0,1,3,7,13,18. They don't match up evenly at all. – bernski Feb 22 '21 at 20:55
  • please see edits. You can see that the diameters don't line up nicely at all. I'm hoping I don't have to do a grotesque loop. – bernski Feb 22 '21 at 21:43
  • Did you try my new code? Is it not what you expecting? – Quang Hoang Feb 22 '21 at 21:49
  • sadly no. it does an index look-up, due to the searchsorted in the array values. But what I really want is for the values to be binned at the new resolution, which would entail some interpolation to the coarser values. So hard to put into words – bernski Feb 22 '21 at 21:53
  • What do you expect in column `325`? My code returns sum of columns `303.14,311.88,320.87`? – Quang Hoang Feb 22 '21 at 21:55
  • Column 325 would have counts from 325 to 350. So that would include part of the counts registered between 320.87 and 330.12, the counts at 330.12, the counts at 339.63 and part of the counts between 349.42 and 359.49. You see what I mean by interpolation? – bernski Feb 22 '21 at 22:00
  • Oh, so you mean. Sounds like `cumsum()`, then `reindex` and `interpolate`. All along `axis='columns'`. – Quang Hoang Feb 22 '21 at 22:05