3

So i want to calculate standard deviation excluding current group using groupby. Here an example of the data:

import pandas as pd
df = pd.DataFrame ({
                      'group' : ['A','A','A','A','A','A','B','B','B','B','B','B'],
                      'team'  : ['1','1','2','2','3','3','1','1','2','2','3','3',]
                      'value' : [1,2,5,7,2,3,7,8,8,9,6,4]
                  })

For example, for group A team 1, i want to calculate the std dev of team 2 and 3, for group A team 2, i want to calculate the std dev of group 1 and 3, and so on.

I managed to do it using groupby and apply but when using it on real data with literally milion of rows, it takes too long. So i am looking for a solution with vectorization.

def std(row, data):
   data = data.loc[data['group']==row['group]]
   return data.groupby(['team']).filter(lambda x:(x['tool]!=row['team']).all())['value'].std()
df['std_exclude'] = df.apply(lambda x: std(data=df),axis=1)
nirina
  • 33
  • 2

1 Answers1

1

You can use transform after combining group and team as a list:

df['std'] = (df.assign(new=df[['group', 'team']].values.tolist())['new'].transform(
             lambda x: df[df['group'].eq(x[0]) & df['team'].ne(x[1])]['value'].std())) 

Output:

group team value    std
0   A   1   1   2.217356
1   A   1   2   2.217356
2   A   2   5   0.816497
3   A   2   7   0.816497
4   A   3   2   2.753785
5   A   3   3   2.753785
6   B   1   7   2.217356
7   B   1   8   2.217356
8   B   2   8   1.707825
9   B   2   9   1.707825
10  B   3   6   0.816497
11  B   3   4   0.816497

There are some equal std values across different groups but you can verify that their std values are indeed equal.

Nuri Taş
  • 3,828
  • 2
  • 4
  • 22
  • Can you explain why you convert 'group' and 'team' to a list, please? – Nev1111 Nov 20 '22 at 15:19
  • 1
    That's not the fastest way, but to be able to use `transform`, I converted `group` and `team` to one single column, that's why I needed to convert them to a list. – Nuri Taş Nov 20 '22 at 15:30
  • Thank you, this is what I am looking for, i will test it for performance. I also thought of similar solution, like transform all values and all values excluding the current row to list, then use something like np.std((Counter(list_a) - Counter(list_b)).elements()). – nirina Nov 21 '22 at 00:19
  • Update: turns out using list inside the column is very slow. I did another method using groupby and transform the value into list then to Counter and substract them. surprisingly substracting two columns containing counter also very slow even thought it looks like a vectorized method. – nirina Nov 24 '22 at 00:18
  • Thanks for the update. groupby and transform combination are indeed a slow method for large datasets. – Nuri Taş Nov 24 '22 at 07:54