2

Say I have a pandas dataframe like this:

Doctor Patient Days
Aaron Jeff 23
Aaron Josh 46
Aaron Josh 71
Jess Manny 55
Jess Manny 85
Jess Manny 46

I want to extract dataframes where a combination of a doctor and a patient occurs more than once. I will be doing further work on the procured dataframes.

So, for instance, in this example, dataframe

Doctor Patient Days
Aaron Josh 46
Aaron Josh 71

would be extracted AND dataframe

Doctor Patient Days
Jess Manny 55
Jess Manny 85
Jess Manny 46

would be extracted.

In accordance with my condition, dataframe

Doctor Patient Days
Aaron Jeff 23

will not be extracted because the combination of Aaron and Jeff occurs only once.

Now, I have a dataframe that has 400000 rows and the code I have written so far is, I think, inefficient in procuring the dataframes that I want. Here is the code:

    doctors = list(df_1.Doctor.unique()) # df_1 being the dataframe with 400K rows 
    for doctor in doctors:
        df_2 = df_1[df_1['Doctor'] == doctor] # extract one sub-dataframe per doctor
        patients = list(df_2.Patient.unique())
        for patient in patients:
            df_3 = df_2[df_2['patient'] == patient] # extract one sub-sub-dataframe per doctor and patient
            if len(df_3) >= 2:
                # do something

As you can see, this is already verging on O(n^2) runtime(I say verging because there are not 400K unique values in each column). Is there a way to minimize the runtime? If so, how can my code be improved?

Thanks!

Umesh

3 Answers3

5

You may check with groupby

d = {x : y  for x, y in df.groupby(['Doctor','Patient']) if len(y) > 1}
d
Out[36]: 
{('Aaron', 'Josh'):   Doctor Patient  Days
 1  Aaron    Josh    46
 2  Aaron    Josh    71, ('Jess', 'Manny'):   Doctor Patient  Days
 3   Jess   Manny    55
 4   Jess   Manny    85
 5   Jess   Manny    46}
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Had the same in mind ;) – mozway Jun 16 '22 at 19:59
  • This is the best answer. I have one more question for you. I see that the variable 'd' is a dictionary where the key is ('Aaron', 'Josh') and the value is the rest. Is there a way now to extract the numbers of the column 'Days' and put it into a list? Thank you. – Umesh Shreeman Jun 16 '22 at 20:37
  • 1
    @UmeshShreeman `[ y['Days'].tolist() for x, y in df.groupby(['Doctor', 'Patient']) if len(y) > 1] ` – BENY Jun 16 '22 at 20:58
  • @BENY, thank you for the response brother. Much love! – Umesh Shreeman Jun 16 '22 at 22:07
3

You can use pd.DataFrame.duplicated like so df.loc[df.duplicated()].

This selects rows where all values are duplicated, to choose for specific columns, you can set the subset parameter:

rows = df.loc[df.duplicated(subset=['doctor', 'patient'])]
Ach113
  • 1,775
  • 3
  • 18
  • 40
  • Unless there is more that I need to add to your code, this code will only give me a singular dataframe with rows that have duplicates as opposed to giving multiple dataframes which contain the duplicates that I want. Thank you for the answer though! – Umesh Shreeman Jun 16 '22 at 20:06
  • 1
    But you can then additionally apply a groupby-operation to extract separate dataframes. I like this answer. It is very explicit. – Thomas Hilger Jun 16 '22 at 20:09
  • I see what you are saying but I have a question. From what I have seen, a groupby operation produces a dictionary. What groupby code would produce separate dataframes? Thank you. – Umesh Shreeman Jun 16 '22 at 20:39
0

here is one way to do it

df2 = (df.groupby(['Doctor','Patient'])['Days'].count() > 1).reset_index()
df2 = df2.drop(df2[df2['Days']==False].index)
df.merge(df2, on=['Doctor','Patient'], suffixes=('','_y')).drop(columns='Days_y')
    Doctor  Patient     Days
0   Aaron   Josh        46
1   Aaron   Josh        71
2   Jess    Manny       55
3   Jess    Manny       85
4   Jess    Manny       46
Naveed
  • 11,495
  • 2
  • 14
  • 21