2

I have a DataFrame as follows.

d = {}
d['Model'] = ['M1', 'M2'] * 4 * 3 * 5
d['Support'] = ['S1', 'S1', 'S2', 'S2'] * 2 * 3 * 5
d['Group'] = ['G1', 'G2', 'G2'] * 2 * 4 * 5
d['Case'] = ['C11', 'C21', 'C22', 'C31', 'C32'] * 2 * 4 * 3
val1 = []
val2 = []
random.seed(0)
for i in range (2*4*3*5):
    val1.append(random.randrange(-10, 11))
    val2.append(random.randrange(-10, 11))
d['val1'] = val1
d['val2'] = val2
df = pd.DataFrame(d)

enter image description here

I am looking for the maximum value of the sum of the positive values for each Support, Group and Case and with reference to the Model and the sum of the minimum values of the sum of the negative values.

This is my attempt:

df1 = df.groupby(['Model', 'Support', 'Group', 'Case'])[['val1', 'val2']].sum()

df2 = df1.groupby(['Model', 'Support', 'Group', 'Case']).agg([
    ('max' , lambda x : x[x > 0].sum()),
    ('min' , lambda x : x[x < 0].sum())
])

df3 = df2.groupby(['Support', 'Group', 'Case']).agg([
    ('max' , max),
    ('Model', lambda x: x.idxmax()[0]),
    ('min' , min),
    ('Model', lambda x: x.idxmin()[0]),
])

(See edition of 14 Jan 2021)

enter image description here

The results of the last DataFrame df3 are fine, but this is not the output format I want. I need to filter the last DataFrame df3 to get the results this way:

enter image description here

Edited 14 Jan 2021

Reviewing the @aneroid results I have seen that the df3 values are not as expected.

For Support S1, Group G1, Case C11 this will be the result:

df_M1S1G1_1 = df.loc[df['Model'] == 'M1']
df_M1S1G1_2 = df_M1S1G1_1.loc[df['Support'] == 'S1']
df_M1S1G1_3 = df_M1S1G1_2.loc[df['Group'] == 'G1']

    Model   Support Group   Case    val1    val2
0   M1      S1      G1      C11     2       3
12  M1      S1      G1      C22     -8      0
24  M1      S1      G1      C32     5       0
36  M1      S1      G1      C21     7       -4
48  M1      S1      G1      C31     -6      -9
60  M1      S1      G1      C11     10      0
72  M1      S1      G1      C22     10      -4
84  M1      S1      G1      C32     -8      -3
96  M1      S1      G1      C21     -5      0
108 M1      S1      G1      C31     -9      7

df_M2S1G1_1 = df.loc[df['Model'] == 'M2']
df_M2S1G1_2 = df_M2S1G1_1.loc[df['Support'] == 'S1']
df_M2S1G1_3 = df_M2S1G1_2.loc[df['Group'] == 'G1']

    Model   Support Group   Case    val1    val2
9   M2      S1      G1      C32     -2      7
21  M2      S1      G1      C21     -10     -8
33  M2      S1      G1      C31     -1      7
45  M2      S1      G1      C11     9       -2
57  M2      S1      G1      C22     -8      0
69  M2      S1      G1      C32     10      7
81  M2      S1      G1      C21     -10     7
93  M2      S1      G1      C31     5       8
105 M2      S1      G1      C11     -6      7
117 M2      S1      G1      C22     -8      -10

Therefore:

0  M1 S1 G1 C11 val1 = 2  val2 = 3
60 M1 S1 G1 C11 val1 = 10 val2 = 0

val1_sum_pos = 12
val1_sum_neg = 0

val2_sum_pos = 3
val2_sum_neg = 0

45  M2 S1 G1 C11 val1 = 9  val2 = -2
105 M2 S1 G1 C11 val1 = -6 val2 = 7

val1_sum_pos = 9
val1_sum_neg = -6

val2_sum_pos = 7
val2_sum_neg = -2

And as a result:

                                                      val1                     
     val2
                            max  Model_max  min  Model_min  max  Model_max  min  Model_min
    Support  Group  Case    
    S1       G1     C11     12   M1         -6   M2         7     M2        -2      
M2

These results are in line with the solution proposed by @aneroid.

aneroid
  • 12,983
  • 3
  • 36
  • 66
PedroBiel
  • 489
  • 1
  • 8
  • 21
  • Just to be clear: do you want the Model which give the min and max _per_ `val` or do you want to combine `val1` and `val2` when finding the min/max? Also, you've said _"looking for the maximum value of the sum of the positive values"_ and _"sum of the minimum values of the sum of the negative values"_ but then you've applied `sum()` to the entire groupby without distinguishing between +ve or -ve values. To me, your second step `df2 = df1.groupby(...).agg(...)` should actually be your first step. – aneroid Jan 13 '21 at 13:48
  • `val1`and `val2`are independent of each other. I'll check your last thought. – PedroBiel Jan 13 '21 at 13:56

3 Answers3

2

Here's a different approach from my previous answer. Requires a little more "prep-code" but fewer steps to massage the dataframe afterwards.

As before, this is based on starting at your step2 (skipping step1):

df1 = df.groupby(['Model', 'Support', 'Group', 'Case']).agg([
    ('sum_pos', lambda x: x[x > 0].sum()),
    ('sum_neg', lambda x: x[x < 0].sum())
])

First, a template dataframe for the results:

res = pd.DataFrame(
    columns=pd.MultiIndex.from_product(
        [('val1', 'val2'),
         ('sum_pos_max', 'Model_max', 'sum_neg_min', 'Model_min'),
]))

This function will be used for apply (again), with the difference that we'll first create an empty DataFrame with the column structure needed. (And optionally, can be modified to use with a for group in grouped loop, without using apply, where each record is appended to it.)

def model_minmax_opt(gr):
    # start with a deep copy of the result template
    tf = res.copy(deep=True)
    for val in ['val1', 'val2']:
        max_pos = gr[(val, 'sum_pos')].idxmax()
        min_neg = gr[(val, 'sum_neg')].idxmin()
        tf.loc[0, [(val, 'sum_pos_max'), (val, 'Model_max')]] = gr.loc[max_pos, [(val, 'sum_pos'), ('Model', '')]].values
        tf.loc[0, [(val, 'sum_neg_min'), (val, 'Model_min')]] = gr.loc[min_neg, [(val, 'sum_neg'), ('Model', '')]].values
    return tf

Then create a grouping and apply it:

group = df1.reset_index().groupby(['Support', 'Group', 'Case'])
df2 = group.apply(model_minmax_opt)
df2.reset_index(level=3, drop=True, inplace=True)  # get rid of the added 0-index

Result df2, same as the other one, with better column ordering:

                                                          val1                                        val2
                   sum_pos_max Model_max sum_neg_min Model_min sum_pos_max Model_max sum_neg_min Model_min
Support Group Case                                                                                        
S1      G1    C11           12        M1          -6        M2           7        M2          -2        M2
              C21            7        M1         -20        M2           7        M2          -8        M2
              C22           10        M1         -16        M2           0        M1         -10        M2
              C31            5        M2         -15        M1          15        M2          -9        M1
              C32           10        M2          -8        M1          14        M2          -3        M1
        G2    C11            9        M1         -23        M1          14        M1          -9        M1
              C21           22        M1         -13        M2          21        M1          -9        M1
              C22           24        M2          -6        M1          14        M2         -12        M2
              C31           30        M2         -28        M1          28        M2          -6        M1
              C32           11        M2         -10        M2          11        M2         -16        M1
S2      G1    C11            9        M1          -8        M1           1        M2         -15        M1
              C21            6        M1          -9        M1          15        M2         -13        M1
              C22           11        M2          -3        M1           0        M1          -8        M2
              C31            5        M1          -7        M2           4        M1          -5        M1
              C32           17        M1           0        M1           0        M1         -13        M2
        G2    C11           20        M1         -10        M2          11        M1         -14        M1
              C21            5        M1         -19        M1           6        M2         -18        M2
              C22           11        M1         -23        M2          24        M1         -15        M2
              C31           26        M1         -22        M2          10        M2         -12        M1
              C32           11        M2         -20        M1          13        M2         -11        M1
aneroid
  • 12,983
  • 3
  • 36
  • 66
  • 1
    Checking your answer @aneroid and seeing that there was a difference in the results I noticed there was an error in my results. I have edited the question (14 Jan 2021) and calculated the result for a case "by hand". In this case the results coincide. – PedroBiel Jan 14 '21 at 08:28
  • Btw, for `(S1, G1, C22)` the sum of positives for `val2` is `0` for both M1 and M2. Since `idxmax` selects the first val in case of a tie, this will favour `M1`. _(This will happen for all cases of `sum=0`.)_ – aneroid Jan 14 '21 at 22:45
1

Here's another two solutions, which are more comparable to your original code and final dataframe in your post.

The first step has a slight modification to the order of the groupby. The minmax() function determines whether it's been given a column of positive or negative sums; and applies idxmin/idxmax as required. And whether the return should be the value or the Model, is passed in as a parameter (y). Using agg can generate additional columns under the existing column index, as another level. This gives the same format as the one in your post.

df1 = df.groupby(['Support', 'Group', 'Case', 'Model']).agg([
    ('sum_pos', lambda x: x[x > 0].sum()),
    ('sum_neg', lambda x: x[x < 0].sum())
])

# Model is the last column of the index, so `-1`
def minmax(x, y):
    """x is the series, y='val' or 'Model'"""
    min = x.min()
    if min >= 0:  # or check x.name
        # postive column, return max's; or both vals are 0
        max_pos = x.idxmax()
        return x[max_pos] if y == 'val' else max_pos[-1]
    # negative column, return min's
    return min if y == 'val' else x.idxmin()[-1]

# use `minmax` to generate columns in an aggregate:
df1.groupby(['Support', 'Group', 'Case']).agg([
    ('val', lambda x: minmax(x, 'val')),
    ('Model', lambda x: minmax(x, 'Model')),
])

Result is the same as my previous two answers, with an identical structure to your post (3-level-index columns):

                                          val1                        val2
                         sum_pos       sum_neg       sum_pos       sum_neg
                       val Model     val Model     val Model     val Model
Support Group Case                                                        
S1      G1    C11       12    M1      -6    M2       7    M2      -2    M2
              C21        7    M1     -20    M2       7    M2      -8    M2
              C22       10    M1     -16    M2       0    M1     -10    M2
              C31        5    M2     -15    M1      15    M2      -9    M1
              C32       10    M2      -8    M1      14    M2      -3    M1
        G2    C11        9    M1     -23    M1      14    M1      -9    M1
              C21       22    M1     -13    M2      21    M1      -9    M1
              C22       24    M2      -6    M1      14    M2     -12    M2
              C31       30    M2     -28    M1      28    M2      -6    M1
              C32       11    M2     -10    M2      11    M2     -16    M1
S2      G1    C11        9    M1      -8    M1       1    M2     -15    M1
              C21        6    M1      -9    M1      15    M2     -13    M1
              C22       11    M2      -3    M1       0    M1      -8    M2
              C31        5    M1      -7    M2       4    M1      -5    M1
              C32       17    M1       0    M1       0    M1     -13    M2
        G2    C11       20    M1     -10    M2      11    M1     -14    M1
              C21        5    M1     -19    M1       6    M2     -18    M2
              C22       11    M1     -23    M2      24    M1     -15    M2
              C31       26    M1     -22    M2      10    M2     -12    M1
              C32       11    M2     -20    M1      13    M2     -11    M1

An even shorter solution but generates a tuple of (min/max val, Model) instead of having them in different columns:

df1 = df.groupby(['Support', 'Group', 'Case', 'Model']).agg([
    ('sum_pos', lambda x: x[x > 0].sum()),
    ('sum_neg', lambda x: x[x < 0].sum())
])

df1.groupby(['Support', 'Group', 'Case']).agg({
    ('val1', 'sum_pos'): lambda x: (x.max(), x.idxmax()[-1]),
    ('val1', 'sum_neg'): lambda x: (x.min(), x.idxmin()[-1]),
    ('val2', 'sum_pos'): lambda x: (x.max(), x.idxmax()[-1]),
    ('val2', 'sum_neg'): lambda x: (x.min(), x.idxmin()[-1]),
})

Result sample:

                                   val1                 val2
                     sum_pos    sum_neg   sum_pos    sum_neg
Support Group Case                                          
S1      G1    C11   (12, M1)   (-6, M2)   (7, M2)   (-2, M2)
              C21    (7, M1)  (-20, M2)   (7, M2)   (-8, M2)
              C22   (10, M1)  (-16, M2)   (0, M1)  (-10, M2)
              C31    (5, M2)  (-15, M1)  (15, M2)   (-9, M1)
...
aneroid
  • 12,983
  • 3
  • 36
  • 66
0

You've said "looking for the maximum value of the sum of the positive values" and "sum of the minimum values of the sum of the negative values" but then in your first step, you've applied sum() to the entire groupby without distinguishing between +ve or -ve values. To me, your second step df2 = df1.groupby(...).agg(...) should actually be your first step.

Setting up df as per your original code, then:

# doing your 2nd step as the first step
df1 = df.groupby(['Model', 'Support', 'Group', 'Case']).agg([
    ('sum_pos', lambda x: x[x > 0].sum()),
    ('sum_neg', lambda x: x[x < 0].sum())
])

# Btw, some sum's are `0` in `df1`

# stacking `val1` and `val2` into a column
df2 = df1.stack(level=0)
df2.index.names = df2.index.names[:-1] + ['val']

Create a function which will calculate the model associated with the "max of positive" and "min of negative" values, per val and include that value in the return:

def model_minmax(gr):
    """modifies both rows of the group object passed in"""
    gr[['sum_pos_max', 'Model_max']] = gr.loc[gr['sum_pos'].idxmax(), ['sum_pos', 'Model']]
    gr[['sum_neg_min', 'Model_min']] = gr.loc[gr['sum_neg'].idxmin(), ['sum_neg', 'Model']]
    return gr

Create the groups and apply() the function above:

group = df2.reset_index().groupby(['Support', 'Group', 'Case', 'val'])
df3 = group.apply(model_minmax)
# sort values and drop every alternate row
df3 = df3.sort_values(['Support', 'Group', 'Case', 'val', 'Model'])[::2]

What df3.head(4) looks like:

   Model Support Group Case   val  sum_neg  sum_pos  sum_pos_max Model_max  sum_neg_min Model_min
0     M1      S1    G1  C11  val1        0       12           12        M1           -6        M2
1     M1      S1    G1  C11  val2        0        3            7        M2           -2        M2
2     M1      S1    G1  C21  val1       -5        7            7        M1          -20        M2
3     M1      S1    G1  C21  val2       -4        0            7        M2           -8        M2

Drop the 'Model', sum_neg', and 'sum_pos' columns and some steps to get the data in a similar format to the one you have:

df3.drop(['Model', 'sum_neg', 'sum_pos'], axis=1, inplace=True)

df4 = df3.pivot(index=['Support', 'Group', 'Case'],
                columns='val',
                values=['sum_pos_max', 'Model_max', 'sum_neg_min', 'Model_min'],
)
df4 = df4.swaplevel(0, 1, axis=1).sort_index(axis=1, level=0)

Result:

               val                                        val1                                        val2
                   Model_max Model_min sum_neg_min sum_pos_max Model_max Model_min sum_neg_min sum_pos_max
Support Group Case                                                                                        
S1      G1    C11         M1        M2          -6          12        M2        M2          -2           7
              C21         M1        M2         -20           7        M2        M2          -8           7
              C22         M1        M2         -16          10        M1        M2         -10           0
              C31         M2        M1         -15           5        M2        M1          -9          15
              C32         M2        M1          -8          10        M2        M1          -3          14
        G2    C11         M1        M1         -23           9        M1        M1          -9          14
              C21         M1        M2         -13          22        M1        M1          -9          21
              C22         M2        M1          -6          24        M2        M2         -12          14
              C31         M2        M1         -28          30        M2        M1          -6          28
              C32         M2        M2         -10          11        M2        M1         -16          11
S2      G1    C11         M1        M1          -8           9        M2        M1         -15           1
              C21         M1        M1          -9           6        M2        M1         -13          15
              C22         M2        M1          -3          11        M1        M2          -8           0
              C31         M1        M2          -7           5        M1        M1          -5           4
              C32         M1        M1           0          17        M1        M2         -13           0
        G2    C11         M1        M2         -10          20        M1        M1         -14          11
              C21         M1        M1         -19           5        M2        M2         -18           6
              C22         M1        M2         -23          11        M1        M2         -15          24
              C31         M1        M2         -22          26        M2        M1         -12          10
              C32         M2        M1         -20          11        M2        M1         -11          13

Note that these values will differ from yours due to your first step, as mentioned above. If you're sure those are correct, it can be added before the one's I've given.

aneroid
  • 12,983
  • 3
  • 36
  • 66