How to filter repeat rows based on certain criteria

Question

I have a dataframe that looks like this, but with a larger number of rows:

id         status       year
1           yes          2013
1           no           2013
1           yes          2014
3           no           2012
4           yes          2014
6           no           2014

I'd like to filter the dataframe so that if the id and year column are the same between two rows, but the status column is different, only the row with the 'yes' status remains. If there's a 'no' for an id and year combination that doesn't have a 'yes' associated with that, I'd still like to keep that. This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.

This leads me to the issue of not being able to just filter the status column to only have rows with 'yes'.

The resulting data frame should look like this, where the second row on the first data frame would be taken out because ID 1 and year 2013 has a 'yes' associated with it. However rows with IDs 3 and 6 remain because there is no yes associated with those ID and year combinations:

id         status       year
1           yes          2013
1           yes          2014
3           no           2012
4           yes          2014
6           no           2014

score 6 · Accepted Answer · answered Dec 19 '18 at 18:04

6

You can compute two conditions:

One using groupby, transform and nunique, and
The other pertaining to the status

OR the two masks, and filter on df:

m1 = df.groupby(['id','year']).status.transform('nunique').eq(1) 
m2 = df.status.eq('yes')
df[m1 | m2]

   id status  year
0   1    yes  2013
2   1    yes  2014
3   3     no  2012
4   4    yes  2014
5   6     no  2014

answered Dec 19 '18 at 18:04

cs95

379,657
97
704
746

To add, even with this solution, categorising `status` is a good idea and will give performance improvements. – jpp Dec 19 '18 at 18:26
1

@jpp (unrelated, but...) I am quite pleased with how pandas handles the behaviour of various functions with categorical data. See [case 1](https://stackoverflow.com/questions/49593222/how-do-i-pivot-one-dataframe-column-to-a-truth-table-with-columns-based-on-anoth/49593274#49593274), crosstab auto-reindexes on categories, and [case 2](https://stackoverflow.com/questions/53719607/groupby-two-columns-with-margins-for-first-level/53719867#53719867) as done by groupby in a similar fashion. – cs95 Dec 19 '18 at 18:34

score 1 · Answer 2 · answered Dec 19 '18 at 18:08

`sort_values` + `drop_duplicates`

This is a good opportunity to use Categorical Data. You can sort by status and then remove duplicates by id and year:

df['status'] = pd.Categorical(df['status'], ordered=True, categories=['yes', 'no'])

res = df.sort_values('status').drop_duplicates(['id', 'year']).sort_index()

print(res)

   id status  year
0   1    yes  2013
2   1    yes  2014
3   3     no  2012
4   4    yes  2014
5   6     no  2014

Depending on your use case, the final sort by index may be unnecessary.

How to filter repeat rows based on certain criteria

2 Answers2

sort_values + drop_duplicates

`sort_values` + `drop_duplicates`