1

If I have a frame d and a function f() in R that looks like these:

df = data.frame(
  group=c("cat","fish","horse","cat","fish","horse","cat","horse"),
  x = c(1,4,7,2,5,8,3,9)
)
f <- function(animal,x) {
  nchar(animal) + mean(x)*(x+1)
}

applying f() to each group to add new column with the result of f() is straightforward:

library(dplyr)
mutate(group_by(df,group),result=f(cur_group(),x))

Output:

  group     x result
  <chr> <dbl>  <dbl>
1 cat       1    7  
2 fish      4   26.5
3 horse     7   69  
4 cat       2    9  
5 fish      5   31  
6 horse     8   77  
7 cat       3   11  
8 horse     9   85  

What is the correct way to do the same in python if d is a pandas.DataFrame?

import numpy as np
import pandas as pd
d = pd.DataFrame({"group":["cat","fish","horse","cat","fish","horse","cat","horse"], "x":[1,4,7,2,5,8,3,9]})

def f(animal,x):
    return [np.mean(x)*(k+1) + len(animal) for k in x]

I know I can get the "correct" values like this:

d.groupby("group").apply(lambda g: f(g.name,g.x))

and can "explode" that into a single Series using .explode(), but what is the correct way to get the values added to the frame, in the correct order, etc:

Expected Output (python)

   group  x  result
0    cat  1     7.0
1   fish  4    26.5
2  horse  7    69.0
3    cat  2     9.0
4   fish  5    31.0
5  horse  8    77.0
6    cat  3    11.0
7  horse  9    85.0
langtang
  • 22,248
  • 1
  • 12
  • 27

2 Answers2

1

We have transform

d['out'] = d.groupby('group')['x'].transform('mean').mul(d['x'].add(1)) + d['group'].str.len()
Out[540]: 
0     7.0
1    26.5
2    69.0
3     9.0
4    31.0
5    77.0
6    11.0
7    85.0
dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234
1

The pandas version would follow a different logic.

Instead of putting everything in a function with apply, one would rather keep the operations vectorial. You can broadcast a scalar output to all members of a group with GroupBy.transform:

g = d.groupby('group')

d['result'] = g['x'].transform('mean').mul(d['x'].add(1))+d['group'].str.len()

If you really want to use apply, use vectorial code inside the function:

def f(g):
    return g['x'].mean()*(g['x']+1)+g['group'].str.len()

d['result'] = d.groupby("group", group_keys=False).apply(f)

output:

   group  x  result
0    cat  1     7.0
1   fish  4    26.5
2  horse  7    69.0
3    cat  2     9.0
4   fish  5    31.0
5  horse  8    77.0
6    cat  3    11.0
7  horse  9    85.0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks, helpful.. I (perhaps mistakenly) gave an example of a function that is too simple. My mistake. My real function `f()` needs many of the columns of the frame `d` plus the value of the grouping variable, and is much more complicated - however, it will return a simple list of values that is the same length as the subseted frame that it is passed. Perhaps your second option will work. – langtang May 11 '22 at 14:44
  • 1
    @langtang yes the second option should work, use `g.name` to access the group name if it is not a simple column – mozway May 11 '22 at 14:47
  • thanks again - this second option works for the data shown, so i've selected. Its not working for me in my actual data, but that is because I'm not sure how to make my `f()` truly vectorized.. (and that's a completely different question). – langtang May 11 '22 at 19:13
  • @langtang maybe you can ask a follow-up question where you describe the logic and provide your non-working code? – mozway May 11 '22 at 19:14
  • appreciate your patience, and its a good idea. Basically, I have a dictionary of models, indexed by groups (say `{"horse":horse_model, "fish":fish_model, "dog":dog_model}`), and I want to pass each subset of the test data, say `d`, to a function `f()` that looks up the model in the dictionary and returns the predicted values; each subset should result in a set of predicted values the same length a the subset, so I figured it would be easy to return the result, as in your example... If I can formulate appropriately, I'll ask a separate question – langtang May 11 '22 at 19:25
  • I've formulated the question, in a way that get closer to my real problem - see https://stackoverflow.com/questions/72207183/ – langtang May 11 '22 at 20:21