2

I have a rather big dataframe (df) containing arrays and NaN in each cell, the first 3 rows look like this:

df:
                 A                B                C
X  [4, 8, 1, 1, 9]              NaN  [8, 2, 8, 4, 9]
Y  [4, 3, 4, 1, 5]  [1, 2, 6, 2, 7]  [7, 1, 1, 7, 8]
Z              NaN  [9, 3, 8, 7, 7]  [2, 6, 3, 1, 9]

I already know (thanks to piRSquared) how to take the element-wise mean over rows for each column so that I get this:

element_wise_mean:
A                        [4.0, 5.5, 2.5, 1.0, 7.0]
B                        [5.0, 2.5, 7.0, 4.5, 7.0]
C    [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]

Now I wonder how to get the respective standard deviation, any idea? Also, I don't understand yet what groupby() is doing, could someone explain its function in more detail?


df

np.random.seed([3,14159])
df = pd.DataFrame(
    np.random.randint(10, size=(3, 3, 5)).tolist(),
    list('XYZ'), list('ABC')
).applymap(np.array)

df.loc['X', 'B'] = np.nan
df.loc['Z', 'A'] = np.nan

element_wise_mean

df2               = df.stack().groupby(level=1)
element_wise_mean = df2.apply(np.mean, axis=0)

element_wise_sd

element_wise_sd   = df2.apply(np.std, axis=0)
TypeError: setting an array element with a sequence.
Svenno Nito
  • 635
  • 1
  • 6
  • 22

2 Answers2

3

Applying np.std using lambda with converting to numpy array is working for me :

element_wise_std = df2.apply(lambda x: np.std(np.array(x), 0))
#axis=0 is by default, so can be omit
#element_wise_std = df2.apply(lambda x: np.std(np.array(x)))
print (element_wise_std)
A                            [0.0, 2.5, 1.5, 0.0, 2.0]
B                            [4.0, 0.5, 1.0, 2.5, 0.0]
C    [2.62466929134, 2.16024689947, 2.94392028878, ...
dtype: object

Or solution from comment:

element_wise_std = df2.apply(lambda x: np.std(x.values, 0))
print (element_wise_std)
A                            [0.0, 2.5, 1.5, 0.0, 2.0]
B                            [4.0, 0.5, 1.0, 2.5, 0.0]
C    [2.62466929134, 2.16024689947, 2.94392028878, ...
dtype: object

I try explain more:

First reshape by stack - columns are added to index and Multiindex is created.

print (df.stack())
X  A    [4, 8, 1, 1, 9]
   C    [8, 2, 8, 4, 9]
Y  A    [4, 3, 4, 1, 5]
   B    [1, 2, 6, 2, 7]
   C    [7, 1, 1, 7, 8]
Z  B    [9, 3, 8, 7, 7]
   C    [2, 6, 3, 1, 9]
dtype: object

Then groupby(level=1) means group by first level of Multiindex - (by values A, B, C) and apply some function. Here it is np.std.

Pandas not working with arrays or lists very nice, so converting is necessary. (It looks like bug)

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    A pandas column is a sequence, and in this case each sequence is an array. It looks like the pandas implementation is not playing nice with using this sequence of arrays. By doing `x.values` or `np.array(x)` the column is explicitly converted to a 2D array and so it works thereafter. Weird it works with `mean` and not `std` - would probably raise an issue on the pandas github to see what else could be going on – Ken Syme Sep 18 '17 at 11:49
  • @KenSyme - Nice idea - I post it [here](https://github.com/pandas-dev/pandas/issues/17571). – jezrael Sep 18 '17 at 12:57
  • Amazing thanks! It is counter intuitive to me that np.mean nd np.std should behave differently on the same dataset, but it really works this way. Would love to hear from you again once you hear why it is like that. – Svenno Nito Sep 18 '17 at 13:55
2

Jezrael beat me to this:

To answer your question about .groupby(), try .apply(print). You'll see what is returned, and made to be used in apply functions:

df2 = df.stack().groupby(axis=1) #groups by the second index of df.stack()
df2.apply(print)
X  A    [4, 8, 1, 1, 9]
Y  A    [4, 3, 4, 1, 5]
Name: A, dtype: object
Y  B    [1, 2, 6, 2, 7]
Z  B    [9, 3, 8, 7, 7]
Name: B, dtype: object
X  C    [8, 2, 8, 4, 9]
Y  C    [7, 1, 1, 7, 8]
Z  C    [2, 6, 3, 1, 9]
Name: C, dtype: object

Conversely, try:

df3 = df.stack().groupby(level=0) #this will group by the first index of df.stack()
df3.apply(print)
X  A    [4, 8, 1, 1, 9]
   C    [8, 2, 8, 4, 9]
Name: X, dtype: object
Y  A    [4, 3, 4, 1, 5]
   B    [1, 2, 6, 2, 7]
   C    [7, 1, 1, 7, 8]
Name: Y, dtype: object
Z  B    [9, 3, 8, 7, 7]
   C    [2, 6, 3, 1, 9]
Name: Z, dtype: object
Tony
  • 1,318
  • 1
  • 14
  • 36