Numpy nanmean and dataframe (possible bug?)

Question

I'm wondering if this is a bug, or possibly I don't understand how nanmean should work with a dataframe. Seems to work if I convert the dataframe to an array, but not directly on the dataframe, nor is any exception raised. Originally noticed here: Fill data gaps with average of data from adjacent days

df1 = DataFrame({ 'x': [1,3,np.nan] })
df2 = DataFrame({ 'x': [2,np.nan,5] })

    x
0   1
1   3
2 NaN

    x
0   2
1 NaN
2   5

In [1503]: np.nanmean( [df1,df2], axis=0 )
Out[1503]: 
     x
0  1.5
1  NaN
2  NaN

In [1504]: np.nanmean( [df1.values, df2.values ], axis=0 )
Out[1504]: 
array([[ 1.5],
       [ 3. ],
       [ 5. ]])

this looks like a bug to me but unclear if this is Pandas or numpy's fault as historically there have been a few problems where the conversion to numpy array is not made, I've encountered this in scikit learn a lot: http://stackoverflow.com/questions/21390084/valueerror-array-contains-nan-or-infinity-in-assert-all-finite-during-linearsv/21410340#21410340 and http://stackoverflow.com/questions/23095725/getting-scikit-learn-to-work-with-pandas — EdChum, Sep 18 '14 at 20:05
Also this: http://stackoverflow.com/questions/22669208/attributeerror-series-object-has-no-attribute-searchsorted-pandas/22669229#22669229. this could be an issue with numpy not calling `__array__` so I don't know if this is really a bug with pandas — EdChum, Sep 18 '14 at 20:09
I guess the lesson is to not generally assume numpy will translate a dataframe or series the way you think it will. Just use .values when there is any doubt... — JohnE, Sep 19 '14 at 14:53

Roger Fan · Accepted Answer · 2014-09-18T20:13:37.340

It's definitely strange behavior. I don't have the answers, but it mostly seems that entire pandas DataFrames can be elements of numpy arrays, which results in strange behavior. I'm guessing this should be avoided as much as possible, and I'm not sure why DataFrames are valid numpy elements at all.

np.nanmean probably converts the arguments into an np.array before applying operations. So lets look at

a = np.array([df1, df2])

First note that this is not a 3-d array like you might think, it's actually a 1-d array, where each element is a DataFrame.

print(a.shape)
# (2,)

print(type(a[0]))
# <class 'pandas.core.frame.DataFrame'>

So nanmean is taking the mean of both of the DataFrames, not of the values inside the dataframes. This also means that the axis argument isn't actually doing anything, and if you try using axis=1 you'll get an error because it's a 1-d array.

np.nanmean(a, axis=1)
# IndexError: tuple index out of range

print(np.nanmean(a))
#      x
# 0  1.5
# 1  NaN
# 2  NaN

That's why you're getting a different answer than when you create the array with values. When you use values, it properly creates the 3-d array of numbers, rather than the weird 1-d array of dataframes.

b = np.array([df1.values, df2.values ])

print(b.shape)
# (2, 3, 1)

print(type(b[1]))
# <class 'numpy.ndarray'>

print(type(b[0,0,0]))
# <class 'numpy.float64'>

These arrays of dataframes have some especially weird behavior though. Say that we make a 3-length array where the third element is np.nan. You might expect to get the same answer from nanmean as we did with a before, as it should exclude the nan value, right?

print(np.nanmean(np.array([df1, df2, np.nan])))
#     x
# 0 NaN
# 1 NaN
# 2 NaN

Yea, so I'm not sure. Best to avoid making these.

`I'm not sure why DataFrames are valid numpy elements at all` pandas is built on numpy arrays but the key thing here is to not expect it to always behave like you think, especially wrt to dataframes: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe-interoperability-with-numpy-functions — EdChum, Sep 19 '14 at 15:29

Numpy nanmean and dataframe (possible bug?)

1 Answers1