13

I've just found out about this strange behaviour of mask, could someone explain this to me?

A) [input]

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)

[output]

A B C
0 NaN NaN hi
1 NaN 3.0 hi
2 4.0 5.0 hi
3 6.0 7.0 hi
4 8.0 9.0 hi

B) [input]

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)

[output]

A B C
0 NaN NaN NaN
1 NaN 3.0 NaN
2 4.0 5.0 NaN
3 6.0 7.0 NaN
4 8.0 9.0 NaN

Thank you in advance

JeB
  • 147
  • 10
  • 3
    Welcome to SO JeB. Good observation! – SeaBean Mar 04 '21 at 11:47
  • maybe you should do `df = ...` if you don't use `in-place`. OR maybe you should first read documentation for `df.mask()` - maybe it explains it – furas Mar 04 '21 at 13:56
  • all functions which have option `inplace` work different with or without `inplace=True`. They create `inplace` for some reason - to work in different way. – furas Mar 04 '21 at 13:59
  • The main point is there are different treatments to Column C with and without inplace=True. One treats values in Column C also meet the criteria and should get changed to NaN (since the parameter `other` of the mask function is default to NaN). This is nothing to do whether we have re-assigned to the original df – SeaBean Mar 04 '21 at 14:02
  • 1
    @furas yes but if I do df=... when in-place is False the output is different respect what I have with in-place=True and that is suspicious – JeB Mar 04 '21 at 14:04

2 Answers2

5

The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.

From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.

I just take df.where() as an example, here is the example code:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])

df1 = df.where(df[['A', 'B']]<3)

df.where(df[['A', 'B']]<3, inplace=True)

In this example, the df is

   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

df[['A', 'B']]<3, the value of cond argument, is

       A      B
0   True   True
1  False  False
2  False  False
3  False  False

Digging into _where() method, the following lines are the key part:

    def _where(...):
        # align the cond to same shape as myself
        cond = com.apply_if_callable(cond, self)
        if isinstance(cond, NDFrame):
            cond, _ = cond.align(self, join="right", broadcast_axis=1)
        ...
        # make sure we are boolean
        fill_value = bool(inplace)
        cond = cond.fillna(fill_value)

Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like

       A      B   C
0   True   True NaN
1  False  False NaN
2  False  False NaN
3  False  False NaN

Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.

Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.

Recall that the df is

   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
  • df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.
  • df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.
# print(df1)
     A    B   C
0  0.0  1.0 NaN
1  NaN  NaN NaN
2  NaN  NaN NaN
3  NaN  NaN NaN

# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
     A    B   C
0  0.0  1.0   2
1  NaN  NaN   5
2  NaN  NaN   8
3  NaN  NaN  11
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
0

Think it simple.

df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)

The last code line is asking for the full dataframe (df.). The condition was applied to columns ['A', 'B'] so, once the column 'C' was not part of the condition it will return NaN for the column C.

This below would be the same of df.mask(df[['A', 'B']]<3)

>>> df[["A","B","C"]].mask(df[['A', 'B']]<3)
     A    B    C
0  NaN  NaN  NaN
1  NaN  3.0  NaN
2  4.0  5.0  NaN
3  6.0  7.0  NaN
4  8.0  9.0  NaN
>>>

And, df.mask(df[['A', 'B', 'C']]<3) will generate an error, because column 'C' is string type

TypeError: '<' not supported between instances of 'str' and 'int'

Finally, to return only columns "A" and "B"

>>> df[["A","B"]].mask(df[['A', 'B']]<3)
     A    B
0  NaN  NaN
1  NaN  3.0
2  4.0  5.0
3  6.0  7.0
4  8.0  9.0

When you apply the command to be done inplace, it will do nothing to column C because of the NaN, which in the mask method will be 'do nothing'

Paulo Marques
  • 775
  • 4
  • 15