2

Given a dataframe as follows:

  city district  year  price
0   bj       cy  2018    NaN
1   bj       cy  2019    6.0
2   sh       hp  2018    4.0
3   sh       hp  2019    3.0
4   bj       hd  2018    7.0
5   bj       hd  2019    NaN

How could I groupby city and district, and filter rows if price is NaN? Thank you.

The output I needed is like this:

  city district  year  price
0   bj       cy  2018    NaN
1   bj       cy  2019    6.0
2   bj       hd  2018    7.0
3   bj       hd  2019    NaN

I have tried with df.groupby(['city', 'district']).filter(lambda df: df[df['price'].isnull()]), but it doesn't work.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 1
    your filter could work with this : ```df.groupby(['city','district']).filter(lambda x: x.price.isna().any())``` – sammywemmy Apr 23 '20 at 09:15

1 Answers1

2

Use GroupBy.transform with test if at least one NaN, it means one True per group:

mask = (df.assign(test = df['price'].isnull())
          .groupby(['city', 'district'])['test']
          .transform('any'))
df = df[mask]

print (df)
  city district  year  price
0   bj       cy  2018    NaN
1   bj       cy  2019    6.0
4   bj       hd  2018    7.0
5   bj       hd  2019    NaN

If performance is not important or small DataFrame is possible use DataFrameGroupBy.filter:

df = df.groupby(['city','district']).filter(lambda x: x['price'].isnull().any())
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • and how to filter groupby rows without `NaN`s? – ah bon Apr 23 '20 at 09:20
  • 2
    @ahbon - Simpliest is invert mask `df1= df[~mask]` – jezrael Apr 23 '20 at 09:21
  • Sorry, one more question, if I need to exclude groups whose `year` is `2019` and `price` is `NaN`? – ah bon Apr 23 '20 at 10:12
  • 1
    @ahbon - Do you think changed `df.assign(test = df['price'].isnull())` to `df.assign(test = df['price'].isnull() & df['year'].eq(2019))` ? – jezrael Apr 23 '20 at 10:13
  • Is it possible to do it with your second solution? – ah bon Apr 23 '20 at 10:14
  • 1
    @ahbon - yop, `lambda x: x['price'].isnull().any() and df['year'].eq(2019).any()` ? There is `and` because scalars – jezrael Apr 23 '20 at 10:15
  • It raises an error: `TypeError: bad operand type for unary ~: 'float'`. – ah bon Apr 23 '20 at 10:27
  • @ahbon - One idea, can you test `df = df.groupby(['city','district']).filter(lambda x: (x['price'].isnull() & x['year']).any())` ? – jezrael Apr 23 '20 at 10:35
  • No missing values for that column, I have checked. In fact, in real data, it's a date column but I converted them to string, so I `df['price'].isnull() & df['date'].str.contains('2019-04-01')`, and `df['price'].isnull() & df['date'].isin(['2019-04-01'])`, both doesn't works. – ah bon Apr 23 '20 at 10:38
  • I put another question here: https://stackoverflow.com/questions/61385609/filter-grouped-rows-based-on-multiple-conditions-in-pandas – ah bon Apr 23 '20 at 11:01