I'm going through the basics of data manipulation with pandas and while working on one of the exercises I've noticed some strange behavior in the work of max()
method when there are missing values in the data. Here is a toy example.
First create a toy data
df = pd.DataFrame({'A': [1, np.nan], 'B': [np.nan, 1]})
It is a 2x2 DataFrame. The only difference between columns is that there is a missing value in the first row in the second column, and in the first column it is in the second row.
A B
0 1.0 NaN
1 NaN 1.0
Now I try to find maximum value in each column in different ways
Applying
DataFrame.max()
method.df.max()
It gives the results I've expected to get
A 1.0 B 1.0 dtype: float64
Using
DataFrame.apply()
method and usingmax
as argument to this methoddf.apply(max)
The result is
A 1.0 B NaN dtype: float64
What is unexpected here is that maximum of column B is reported to be
NaN
. I assume that the cause is theNaN
value in the first row.Using
DataFrame.apply()
method and using'max'
as argument to this methoddf.apply('max')
Here the results are expected.
A 1.0 B 1.0 dtype: float64
Why the result of second approach is different from the other two?