4

I have a dataframe looking like this with three columns (10 different stimuli, 16 trials and a data column containing lists of equal lengths). I would simply like to get the element-wise mean of the data column based on the stimulus. As I have 10 different stimuli, it should result in 10 arrays for each stimulus which also are the mean of all data arrays over trials.

enter image description here

I thought about something like this but it gives me somewthing really weird.

df.groupby('stimulus').apply(np.mean)
>> IndexError: tuple index out of range

enter image description here


Build my dataframe

trial_vec       = np.tile(np.arange(16)+1, 10)     
stimulus_vec    = np.repeat([-2., -1.75, -1., -0.75, -0.5,  0.5,  1.,  1.25,  1.75,  2.5 ], 16)                  
data_vec        = np.random.randint(0, 16, size=160)
df              = pd.DataFrame({'trial': trial_vec, 'stimulus': stimulus_vec, 'data': data_vec}).astype('object')
df["data"]      = [np.random.rand(4).tolist() for i in range(160)]
df
Svenno Nito
  • 635
  • 1
  • 6
  • 22
  • I don't understand what you need. What is the element wise mean? – Ted Petrou Sep 29 '17 at 19:42
  • 1
    This seems fundamentally like a bad design... lists in data-frames rarely make sense. If you want element-wise means, use *numpy* arrays. – juanpa.arrivillaga Sep 29 '17 at 19:43
  • How else would you store the data? My original list have the size 1x35000. Due to 16 trials for 10 stimuli each i have 160 of these lists. I thought that would be a convenient way to store them. – Svenno Nito Sep 30 '17 at 10:41

3 Answers3

5

You can convert data in each group to a 2d list which makes sure the object can be converted to a 2d numpy array when the number of elements in each cell of the data column are the same, and then take mean over axis=0 (column-wise mean):

df.groupby('stimulus').data.apply(lambda g: np.mean(g.values.tolist(), axis=0))

#stimulus
#-2.00    [0.641834320107, 0.427639804593, 0.42733812964...
#-1.75    [0.622484839138, 0.529860126072, 0.63310754064...
#-1.00    [0.546323060494, 0.465573022088, 0.54947320390...
#-0.75    [0.431675052484, 0.367636755052, 0.45263194597...
#-0.50    [0.423135952819, 0.544110613089, 0.55496058720...
# 0.50    [0.421858616927, 0.439204977418, 0.43153540636...
# 1.00    [0.612239664017, 0.499305567037, 0.46284515082...
# 1.25    [0.498544756769, 0.481073640317, 0.43564801829...
# 1.75    [0.51821909334, 0.44904063908, 0.358509374567,...
# 2.50    [0.465606275355, 0.516448419224, 0.33715002349...
#Name: data, dtype: object

Or stack data as a 2d array, and then take mean over axis=0:

df.groupby('stimulus').data.apply(lambda g: np.mean(np.stack(g), axis=0))

Edit: if you have nans in the data column, you can use np.nanmean to calculate mean without nans:

df.groupby('stimulus').data.apply(lambda g: np.nanmean(np.stack(g), axis=0))
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Nice and concise. I forgot to mention that I have NaN values inside my datacolumn instead of lists as well. Probably that's why your solution doesn't work for me: ValueError: setting an array element with a sequence. But maybe others will 2D lists without NaNs, I think then your answer is very neat. – Svenno Nito Sep 30 '17 at 10:56
  • If you have `nans` in *data*, you can use `np.nanmean` which calculates the average without `nan`. – Psidom Sep 30 '17 at 14:30
  • I think that only works if I have lists of NaN of the same lengths right? Because if I have single NaN values, not in a list but a single NaN in the cell. – Svenno Nito Sep 30 '17 at 17:39
  • Then you would need to drop na from the series before calculating `mean`, i.e. something like `...np.mean(np.stack(g.dropna()), axis=0)`. – Psidom Sep 30 '17 at 17:43
3

An update

This is actually a rare usecase for a grouper not in the current DataFrame.

df['data'].apply(pd.Series).groupby(df['stimulus']).mean()

Original

I'm not sure what exactly you are trying to do but you typically should not have lists in your dataframe. I would properly format your data first and then take the mean of each column by group.

data_proper = df['data'].apply(pd.Series)
df_new = pd.concat([df.drop('data',axis=1), data_proper], axis=1)
df_new.head()

  stimulus trial         0         1         2         3
0       -2     1  0.046361  0.967723  0.707726  0.708462
1       -2     2  0.270566  0.778324  0.638878  0.276983
2       -2     3  0.261356  0.563411  0.639114  0.111150
3       -2     4  0.124745  0.532362  0.869781  0.142513
4       -2     5  0.707596  0.137417  0.493232  0.098975

df_new.groupby('stimulus').mean()

                 0         1         2         3
stimulus                                        
-2.00     0.516795  0.458579  0.527230  0.360560
-1.75     0.418950  0.497287  0.442577  0.518487
-1.00     0.569175  0.350724  0.429025  0.562950
-0.75     0.474533  0.517560  0.472101  0.658333
-0.50     0.481185  0.426829  0.414059  0.571252
 0.50     0.432719  0.563101  0.421617  0.531289
 1.00     0.478947  0.412383  0.458543  0.590503
 1.25     0.596648  0.520953  0.515184  0.513206
 1.75     0.492729  0.524673  0.567336  0.465172
 2.50     0.369798  0.540603  0.499210  0.605297

Or in one continuous line inspired by @Scott Boston

df.drop('data', axis=1)\
  .assign(**df.data.apply(pd.Series).add_prefix('col'))\
  .groupby('stimulus').mean()
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • Hey, thanks for your input! Also thanks for visualizing your steps, that helps a lot with understanding. In your last step, when you group by stimulus and then take the mean, what happens to the column 'trials'? – Svenno Nito Sep 30 '17 at 10:54
  • Also, and I know you don't like it, but I would need my old structure back, meaning that all datapoints will be in a list again (I store multiple of those outputs in an excel file later and the code only works, if the data is in a list). Do you know how to reconstruct "df_new" to my original format? – Svenno Nito Sep 30 '17 at 11:31
0

By using reduce and operator.add

import numpy as np
import pandas as pd
import operator
from functools import reduce
df.groupby('stimulus').data.apply(lambda l : np.array(list(reduce(lambda x, y: map(operator.add, x,y), l)))/len(l))
BENY
  • 317,841
  • 20
  • 164
  • 234