0

while I was scripting a column, I came into something very interesting. There are two ways in which I was using pd.DataFrame.isna for single and multiple columns. While I am scripting in multiple brackets pd.df.isna is returning the entire code back to me.

override[override.ORIGINAL_CREDITOR_ID.notna()].shape

override[override[['ORIGINAL_CREDITOR_ID']].notna()].shape

So the first line returns me 3880 rows and runs in 2.5ms whereas the second one returns me all the rows present in the override data frame and that too takes 3.08s. Is there a reason why that is happening? How can I avoid this because I have to make it configurable for passing multiple columns in the second query?

  • 3
    `override.ORIGINAL_CREDITOR_ID` is a `pandas.Series` and `override[['ORIGINAL_CREDITOR_ID']]` is a `pandas.DataFrame` – not_speshal Jul 30 '21 at 15:23

2 Answers2

0

The first line of code is selection with a Boolean Series, while the second is selection with a Boolean DataFrame, and these are handled very differently as DataFrames are 2D and there are 2 axes to align. There's a section dedicated to illustrating this difference in the pandas docs.

In the first case, selection with a Boolean Series, you return all columns for only the rows that are True in the Boolean Series.

In the case of selection with a Boolean DataFrame, you return an object the same shape as the original where the True values in the Boolean DataFrame are kept and any False values are replaced with NaN. (It's actually implemented as DataFrame.where) For rows and columns that don't appear in your Boolean DataFrame mask, those become NaN by default.

import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.NaN, 4],
                   'b': [10, 11, 12, 13]})

# Boolean Series, return all columns only for for rows where condition is True
df[df['a'] == 2]
#     a   b
#1  2.0  11


# Boolean DataFrame, equivalent to df.where(df[['a']] == 2)
df[df[['a']] == 2]
#     a   b
#0  NaN NaN
#1  2.0 NaN
#2  NaN NaN
#3  NaN NaN
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Got it, this is really helpful. So is there a way I can filter nans on the basis of two columns or do I have to filter by getting rid of nans value? – thisismihir Jul 30 '21 at 15:35
  • @thisismihir You could look for `NaN` in both columns and then still collapse to a Series. So if you wanted rows with a `NaN` in either column you could do: `df[df[['col1', 'col2']].isnull().any(1)]`, using `.any(axis=1)` to collapse the DataFrame mask to a Series mask, asking whether for that row there is a null in any of those columns. – ALollz Jul 30 '21 at 15:37
0

So, I found a way where once I've got the data frame of True and False, I then take a bitwise operation by using all or any. You can refer to:

override[override[['ORIGINAL_CREDITOR_ID']].notna().all(1)].shape

This would help me in filtering the results I want and that too much faster i.e. in 8ms. I found this answer on here. So hope you find that useful. Let me know if you need more understanding.