Check non-duplicates by groups Python

Question

I want to check non-duplicates in groups and return all these non-duplicates but only when these keys shows multiple times. For example,

Var  Code   Description
1    A      ABC
2    B      EF
1    A      ABCD
2    B      EF
1    B      ABC
3    C      DD

I expect to get some result like this

Var  Code Description
1    A    ABC
1    A    ABCD

The reason I want this table is to detect which (var,code) pair have multiple description, because in my case, if it has multiple description, it means the description for these pairs should be modified to be unique.

This table returns rows that have same variable name and same code name but different description. Because for my data, a code of a variable should have only one description instead of multiple descriptions.

3 C DD is not returned is because there is only one record about (3,C) pair and DD is the only description for this pair.

2 B EF is not returned is because though two rows contain (2,B) pair, they have the same description which is normal.

1 B ABC is not returned is because though var 1 shows three times but (1,B) pair shows only one time and has only one description ABC.

Any solutions?

Thank you!

And why doesn't `1 B ABC` get returned (as that's the only ABC in its group)? — fuglede, Nov 19 '18 at 20:50
Possible duplicate of [Better way to identify duplicates in a group in a Pandas dataframe?](https://stackoverflow.com/questions/46217510/better-way-to-identify-duplicates-in-a-group-in-a-pandas-dataframe) — Thelouras, Nov 19 '18 at 20:51
@ScottBoston because DD is the only description for 3 C, so I don't need to look into it. It is normal to have one description for one (var,code) pair. — Eleanor, Nov 19 '18 at 20:51
@fuglede because ABC is the only description for 1 B, so I don't need to look into it. It is normal to have one description for one (var,code) pair. — Eleanor, Nov 19 '18 at 20:53
pretty confusing question. Maybe you want this: `df.drop_duplicates().sort_values(df.columns.tolist())`? — robertwest, Nov 19 '18 at 20:54
By that logic, wouldn't you want 2 B EF in your output? Or are you removing all entries that are duplicated, *then* removing all groups for which there is only a single element? — fuglede, Nov 19 '18 at 20:56
oh, you only want to see the sets that have more than one description? — robertwest, Nov 19 '18 at 20:57
first get rid of duplicate rows and then create a dictionary or list, and use isin to filter the data. — pizza lover, Nov 19 '18 at 20:57
Umm... are you after `df[df.groupby(['Var', 'Code']).Description.transform('nunique').gt(1)]` maybe? That seems to match your example output but doesn't seem to match your description... I guess it's possible you'd also want to throw another `.drop_duplicates(['Var', 'Code'])` on that as well... — Jon Clements, Nov 19 '18 at 20:57
@JonClements: If there were a 2 B ABCD row, then that would return the duplicated 2 B EF rows. — fuglede, Nov 19 '18 at 20:59
@fuglede which might be fine... I don't think anyone's quite got what's being asked here so far :) — Jon Clements, Nov 19 '18 at 21:00
If you need to modify them to be unique, perhaps just `df.groupby(['Var', 'Code']).Description.transform('first')`. Unless you want to pick something totally different than either value that appears. — ALollz, Nov 19 '18 at 21:11

score 2 · Accepted Answer · answered Nov 19 '18 at 21:07

2

You can use groupby and filter the rows using nunique

new_df = df.groupby(['Var', 'Code']).filter(lambda x: x['Description'].nunique() > 1) 

    Var Code    Description
0   1   A       ABC
2   1   A       ABCD

answered Nov 19 '18 at 21:07

Vaishali

37,545
5
58
86

robertwest · Answer 2 · 2018-11-20T14:16:59.223

0

msk1 = df.duplicated(subset=['Code', 'Var'], keep=False)
msk2 = df.duplicated(subset=['Code', 'Var', 'Description'], keep=False)
msk3 = msk1 & ~msk2
df[msk3]

# msk1 filters to all rows where both Var and Code are duplicated
print(df[msk1])
Var  Code Description
1    A    ABC
1    A    ABCD
2    B    EF
2    B    EF

# msk2 further filters to where Var Code and Description are duplicated 
print(df[msk2])
Var Code Description 
2   B    EF
2   B    EF

# so then finally msk3 is msk1 without msk2 
# give me all rows where Var and Code are duplicated 
# except for the ones where the Description is also duplicated
# - the desired result 
print(df[msk3])
Var  Code Description
1    A    ABC
1    A    ABCD

edited Nov 20 '18 at 14:16

answered Nov 19 '18 at 21:02

robertwest

904
7
13

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Nic3500 Nov 20 '18 at 01:23
Thanks. I expanded on my answer to describe how it works . – robertwest Nov 20 '18 at 02:21

Check non-duplicates by groups Python

2 Answers2