Drop rows in pandas dataframe depending on order and NaN

Question

I am using pandas to import a dataframe, and want to drop certain rows before grouping the information.

How do I go from the following (example):

    Name1   Name2   Name3
0   A1  B1  1
1   NaN NaN 2
2   NaN NaN 3
3   NaN B2  4
4   NaN NaN 5   
5   NaN NaN 6
6   NaN B3  7
7   NaN NaN 8
8   NaN NaN 9
9   A2  B4  1
10  NaN NaN 2
11  NaN NaN 3
12  NaN B5  4
13  NaN NaN 5
14  NaN NaN 6
15  NaN B6  7
16  NaN NaN 8
17  NaN NaN 9

to:

    Name1   Name2   Name3
0   A1  B1  1
3   NaN B2  4
6   NaN B3  7
8   NaN NaN 9
9   A2  B4  1
12  NaN B5  4
15  NaN B6  7
17  NaN NaN 9

(My actual case consists of several thousand lines with the same structure as the example)

I have tried removing rows with NaN in Name2 using df=df[df['Name2'].notna()] , but then I get this:

    Name1   Name2   Name3
0   A1  B1  1
3   NaN B2  4
6   NaN B3  7
9   A2  B4  1
12  NaN B5  4
15  NaN B6  7

I also need to keep line 8 and 17 in the example above.

When you say that you need to keep 8 and 9, is this fixed? Or you you have a logic for that? (e.g. the last row before non-NA Name1 or the end?) — mozway, Nov 24 '22 at 14:16

score 1 · Accepted Answer · answered Nov 24 '22 at 14:19

Assuming you want to keep the rows that are either:

not NA in column "Name2"
or the last row before a non-NA "Name1" or the end of data

You can use boolean indexing:

# is the row not-NA in Name2?
m1 = df['Name2'].notna()
# is is the last row of a group?
m2 = df['Name1'].notna().shift(-1, fill_value=True)

# keep if either of the above condition is True
out = df[m1|m2]

Output:

   Name1 Name2  Name3
0     A1    B1      1
3    NaN    B2      4
6    NaN    B3      7
8    NaN   NaN      9
9     A2    B4      1
12   NaN    B5      4
15   NaN    B6      7
17   NaN   NaN      9

Intermediates:

   Name1 Name2  Name3     m1     m2  m1|m2
0     A1    B1      1   True  False   True
1    NaN   NaN      2  False  False  False
2    NaN   NaN      3  False  False  False
3    NaN    B2      4   True  False   True
4    NaN   NaN      5  False  False  False
5    NaN   NaN      6  False  False  False
6    NaN    B3      7   True  False   True
7    NaN   NaN      8  False  False  False
8    NaN   NaN      9  False   True   True
9     A2    B4      1   True  False   True
10   NaN   NaN      2  False  False  False
11   NaN   NaN      3  False  False  False
12   NaN    B5      4   True  False   True
13   NaN   NaN      5  False  False  False
14   NaN   NaN      6  False  False  False
15   NaN    B6      7   True  False   True
16   NaN   NaN      8  False  False  False
17   NaN   NaN      9  False   True   True

score 0 · Answer 2 · answered Nov 24 '22 at 14:05

You can use the thresh argument in df.dropna.

# toy data
data = {'name1': [np.nan, np.nan, np.nan, np.nan], 'name2': [np.nan, 1, 2, np.nan], 'name3': [1, 2, 3, 4]}
df = pd.DataFrame(data)

   name1  name2  name3
0    NaN    NaN      1
1    NaN    1.0      2
2    NaN    2.0      3
3    NaN    NaN      4

To remove rows with 2+ NaN, just do this:

df.dropna(thresh = 2)

   name1  name2  name3
1    NaN    1.0      2
2    NaN    2.0      3

If you want to keep lines 8 and 17, you may want to first save them separately in another variable and add them to df afterwards using df.append and then resorting by index.

Drop rows in pandas dataframe depending on order and NaN

2 Answers2