1

I'm going through the basics of data manipulation with pandas and while working on one of the exercises I've noticed some strange behavior in the work of max() method when there are missing values in the data. Here is a toy example.

First create a toy data

df = pd.DataFrame({'A': [1, np.nan], 'B': [np.nan, 1]})

It is a 2x2 DataFrame. The only difference between columns is that there is a missing value in the first row in the second column, and in the first column it is in the second row.

    A   B
0   1.0 NaN
1   NaN 1.0

Now I try to find maximum value in each column in different ways

  1. Applying DataFrame.max() method.

    df.max()        
    

    It gives the results I've expected to get

    A    1.0
    B    1.0
    dtype: float64
    
  2. Using DataFrame.apply() method and using max as argument to this method

    df.apply(max)
    

    The result is

    A    1.0
    B    NaN
    dtype: float64
    

    What is unexpected here is that maximum of column B is reported to be NaN. I assume that the cause is the NaN value in the first row.

  3. Using DataFrame.apply() method and using 'max' as argument to this method

    df.apply('max')
    

    Here the results are expected.

    A    1.0
    B    1.0
    dtype: float64
    

Why the result of second approach is different from the other two?

mskoryk
  • 506
  • 1
  • 5
  • 12

0 Answers0