2

I have four columns with values: 'A', 'B', 'C' and 'D' and one column containing either of the four letters.

           Letter    A    B  C    D           max
12063289        D    5    9  0   39             0
12063290        D    3    25 0   79             0
12063291        B    0  232  0    0             0
12063292        A  351    0  0    0             0
12063293        D    0    0  0  394             0

Based on the value of the letter column, I'd like to compute the maximum value of the other three columns.

           Letter    A    B  C    D           max
12063289        D    5    9  0   39             9
12063290        D    3    25 0   79             25
12063291        B    0  232  0   16             16
12063292        A  351    0  200  0             200
12063293        D    0    0  0  394             0

Say the above data is stored in a variable df, I have tried to do the following:

import numpy as np
import pandas as pd

columns = {'A':['B','C','D'], 
          'B':['A','C','D'],  
          'C':['A','B','D'],
          'D':['A','B','C']}
for letter in ['A', 'B', 'C', 'D']:
    mask = df.loc[df['Letter']==letter]
    np.max(df[mask][columns[letter]], out=df[mask]['max'])

I basically want the max function to work on only the relevant columns of the masked dataframe and write back to the 'max' column of the original dataframe (df) in exactly the right positions but pandas complains in the last line:

ValueError: Must pass DataFrame with boolean values only

The question is how do I target exactly those cells of those rows to receive the output of the max() function in order to not use unnecessary space (I can do this with the apply function but it takes up enormous amounts of space that I don't have).

tobsecret
  • 2,442
  • 15
  • 26
  • So, `Letter` has the IDs of the max ones from each row? – Divakar Jan 27 '17 at 22:50
  • `Letter` has the IDs of the column that is not relevant for calculating the max in each respective row. In the first row the value if `Letter` is 'D', hence the max value of that column should be calculated from columns 'A', 'B' and 'C' . – tobsecret Jan 28 '17 at 18:11

1 Answers1

1

apply

cols = list('ABCD')
df.apply(lambda x: x.loc[cols].drop(x.Letter).max(), 1)

12063289      9
12063290     25
12063291     16
12063292    200
12063293      0
dtype: int64

set_index + query + groupby

d1 = df.set_index(
    'Letter', append=True
)[list('ABCD')].rename_axis('Col', 1).stack().to_frame('value')
d1.query('Letter != Col').groupby(level=0).value.max()

12063289      9
12063290     25
12063291     16
12063292    200
12063293      0
dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks for your reply, the df.apply solution reduces the memory usage by half in comparison to my naive solution. But it still uses about 8 times the size of the original dataframe, if I use the first of the two solutions you provided, is there a way to improve with np.max's out= parameter? – tobsecret Jan 30 '17 at 15:12