Finding non-unique rows in Pandas Dataframe

Question

Say I have a pandas dataframe like this:

Doctor	Patient	Days
Aaron	Jeff	23
Aaron	Josh	46
Aaron	Josh	71
Jess	Manny	55
Jess	Manny	85
Jess	Manny	46

I want to extract dataframes where a combination of a doctor and a patient occurs more than once. I will be doing further work on the procured dataframes.

So, for instance, in this example, dataframe

Doctor	Patient	Days
Aaron	Josh	46
Aaron	Josh	71

would be extracted AND dataframe

Doctor	Patient	Days
Jess	Manny	55
Jess	Manny	85
Jess	Manny	46

would be extracted.

In accordance with my condition, dataframe

Doctor	Patient	Days
Aaron	Jeff	23

will not be extracted because the combination of Aaron and Jeff occurs only once.

Now, I have a dataframe that has 400000 rows and the code I have written so far is, I think, inefficient in procuring the dataframes that I want. Here is the code:

    doctors = list(df_1.Doctor.unique()) # df_1 being the dataframe with 400K rows 
    for doctor in doctors:
        df_2 = df_1[df_1['Doctor'] == doctor] # extract one sub-dataframe per doctor
        patients = list(df_2.Patient.unique())
        for patient in patients:
            df_3 = df_2[df_2['patient'] == patient] # extract one sub-sub-dataframe per doctor and patient
            if len(df_3) >= 2:
                # do something

As you can see, this is already verging on O(n^2) runtime(I say verging because there are not 400K unique values in each column). Is there a way to minimize the runtime? If so, how can my code be improved?

Thanks!

Umesh

score 5 · Accepted Answer · answered Jun 16 '22 at 19:57

5

You may check with groupby

d = {x : y  for x, y in df.groupby(['Doctor','Patient']) if len(y) > 1}
d
Out[36]: 
{('Aaron', 'Josh'):   Doctor Patient  Days
 1  Aaron    Josh    46
 2  Aaron    Josh    71, ('Jess', 'Manny'):   Doctor Patient  Days
 3   Jess   Manny    55
 4   Jess   Manny    85
 5   Jess   Manny    46}

answered Jun 16 '22 at 19:57

BENY

317,841
20
164
234

1

Had the same in mind ;) – mozway Jun 16 '22 at 19:59
This is the best answer. I have one more question for you. I see that the variable 'd' is a dictionary where the key is ('Aaron', 'Josh') and the value is the rest. Is there a way now to extract the numbers of the column 'Days' and put it into a list? Thank you. – Umesh Shreeman Jun 16 '22 at 20:37
1

@UmeshShreeman `[ y['Days'].tolist() for x, y in df.groupby(['Doctor', 'Patient']) if len(y) > 1] ` – BENY Jun 16 '22 at 20:58
@BENY, thank you for the response brother. Much love! – Umesh Shreeman Jun 16 '22 at 22:07

score 3 · Answer 2 · answered Jun 16 '22 at 19:55

3

You can use pd.DataFrame.duplicated like so df.loc[df.duplicated()].

This selects rows where all values are duplicated, to choose for specific columns, you can set the subset parameter:

rows = df.loc[df.duplicated(subset=['doctor', 'patient'])]

answered Jun 16 '22 at 19:55

Ach113

1,775
3
18
40

Unless there is more that I need to add to your code, this code will only give me a singular dataframe with rows that have duplicates as opposed to giving multiple dataframes which contain the duplicates that I want. Thank you for the answer though! – Umesh Shreeman Jun 16 '22 at 20:06
1

But you can then additionally apply a groupby-operation to extract separate dataframes. I like this answer. It is very explicit. – Thomas Hilger Jun 16 '22 at 20:09
I see what you are saying but I have a question. From what I have seen, a groupby operation produces a dictionary. What groupby code would produce separate dataframes? Thank you. – Umesh Shreeman Jun 16 '22 at 20:39

score 0 · Answer 3 · answered Jun 16 '22 at 20:14

here is one way to do it

df2 = (df.groupby(['Doctor','Patient'])['Days'].count() > 1).reset_index()
df2 = df2.drop(df2[df2['Days']==False].index)
df.merge(df2, on=['Doctor','Patient'], suffixes=('','_y')).drop(columns='Days_y')

    Doctor  Patient     Days
0   Aaron   Josh        46
1   Aaron   Josh        71
2   Jess    Manny       55
3   Jess    Manny       85
4   Jess    Manny       46

Finding non-unique rows in Pandas Dataframe

3 Answers3