Pandas: remove group from dataframe if the last value in the group meets a required condition

Question

Similar question found here but a little twist that I cant solve, and an error that I keep getting. Is it possible for the code to look at the last item in the group and depending on if a condition exits, it will drop the whole group? The condition is: drop entire group if the last row has a 'Status' between 30 and 39 inclusive.

I used a similar approach as found in the link above using apply() and filter(). When I use the apply method, this code removes just that row that has that condition but not the complete group:

df.groupby('Sort_Key').apply(lambda x: x[~x['Status'].between(30,39,inclusive=True)])

but when I use filter method, as suggested in the link above, I get an error

df.groupby('Sort_Key').filter(lambda x: x[~x['Status'].between(30,39,inclusive=True)])

TypeError: filter function returned a Series, but expected a scalar bool

So 1st, what is wrong with my filter approach? Is this the best approach for this problem?

2nd. Is there a way to base the conditional on the final row in that series.

Data Set:

df = pd.DataFrame({'Sort_Key': ['100000003', '100000009', '100000009', '100000009', '100000034','100000034', '100000034', '100000048'], 
               'Claim_no': [40000, 50000, 42000, 50000, 42000, 50000, 42000, 5000],
              'Discharge': [456435, 50345345, 4345435,345756,34557,6867456,345435,346546],
              'Admit': [678678, 67867867, 678678,678679,8989,67867,89697,9678678],
              'Status': [12, 12, 52,30,30,18,19,30]})

Original dataframe:

                     Admit      Claim_no    Discharge   Sort_Key    Status
Sort_Key                        
100000003   0       678678      40000       456435      100000003   12
100000009   1       67867867    50000       50345345    100000009   12
            2       678678      42000       4345435     100000009   52
            3       678679      50000       345756      100000009   30
100000034   4       8989        42000       34557       100000034   30
            5       67867       50000       6867456     100000034   18
            6       89697       42000       345435      100000034   19
100000048   7       9678678     5000        346546      100000048   30

Final dataframe:

                     Admit      Claim_no    Discharge   Sort_Key    Status
Sort_Key                        
100000003   0       678678      40000       456435      100000003   12
100000034   4       8989        42000       34557       100000034   30
            5       67867       50000       6867456     100000034   18
            6       89697       42000       345435      100000034   19

score 3 · Accepted Answer · answered Jan 10 '18 at 16:39

To fix your code

df.groupby('Sort_Key').filter(lambda x: ~pd.Series(x['Status'].iloc[-1]).between(30,39,inclusive=True).any())
Out[325]: 
    Admit  Claim_no  Discharge   Sort_Key  Status
0  678678     40000     456435  100000003      12
4    8989     42000      34557  100000034      30
5   67867     50000    6867456  100000034      18
6   89697     42000     345435  100000034      19

What I will do

s=df.groupby('Sort_Key').Status.last()
df.loc[~df.Sort_Key.isin(s[s!=30].index)]
Out[333]: 
      Admit  Claim_no  Discharge   Sort_Key  Status
1  67867867     50000   50345345  100000009      12
2    678678     42000    4345435  100000009      52
3    678679     50000     345756  100000009      30
7   9678678      5000     346546  100000048      30

Wow. Thank you for the fast response! Can I ask what the advantage and disadvantage of each is? — CandleWax, Jan 10 '18 at 16:52

score 0 · Answer 2 · answered Jan 10 '18 at 17:22

I'd do it this way:

In [25]: df[df.groupby('Sort_Key')['Status']
              .transform(lambda x: not x.tail(1).between(30,39).any())]
Out[25]:
    Admit  Claim_no  Discharge   Sort_Key  Status
0  678678     40000     456435  100000003      12
4    8989     42000      34557  100000034      30
5   67867     50000    6867456  100000034      18
6   89697     42000     345435  100000034      19

Pandas: remove group from dataframe if the last value in the group meets a required condition

2 Answers2