5

I have a large data frame (more than 100 columns, and several 100 thousand rows) with a number of rows that contain duplicate data. I am trying to remove the duplicate rows, keeping the one with the largest value in a different column.

Essentially, I am sorting the data into individual bins based on time period, so across periods, one would expect to find a lot of duplication, as most entities exist across all time periods. What can't be allowed, however, is for the same entity to appear more than once in a given time period.

I tried the approach in python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B, on a subset of the data, with the plan to recombine with the original dataframe, df.

Example data subset:

              unique_id   period_id   liq
index                                   
19            CAN00CE0     199001  0.017610
1903          **USA07WG0** 199001  1.726374
12404         **USA07WG0** 199001  0.090525
13330         USA08DE0     199001  1.397143
14090         USA04U80     199001  2.000716
12404         USA07WG0     199002  0.090525
13330         USA08DE0     199002  1.397143
14090         USA04U80     199002  2.000716

In the example above, I would like to keep the first instance (as liq is higher with 1.72) and discard the second instance (liq is lower, with 0.09). Note that there can be more than two duplicates in a given period_id.

I tried this was but it was very slow for me (I stopped it after more than 5 minutes):

def h(x):
    x = x.dropna() #idmax fails on nas, and happy to throw out where liq is na.
    return x.ix[x.liq.idmax()]

df.groupby([‘holt_unique_id’, ‘period_id’], group_keys = False).apply(lambda x: h(x))

I ultimately did the below, which is more verbose and ugly, and simply throws out all but one duplicate, but this is also very slow! given the speed of other operations of similar complexity, I thought I would ask here for a better solution.

So my request is really to fix the above code so that it is fast, the below is given as guidance, and if in the vein of the below, perhaps I could also have discarded the duplicates based on index, rather than the reset_index/set_index approach that I have employed:

def do_remove_duplicates(df):
    sub_df = df[['period_id', 'unique_id']] 
    grp = sub_df.groupby(['period_id', 'unique_id'], as_index = False)
    cln = grp.apply(lambda x: x.drop_duplicates(cols = 'unique_id'))   #apply drop_duplicates.  This line is the slow bit!
    cln = cln.reset_index()   #remove the index stuff that has been added
    del(cln['level_0'])   #remove the index stuff that has been added
    cln.set_index('level_1', inplace = True)   #set the index back to the original (same as df).
    df_cln = cln.join(df, how = 'left', rsuffix = '_right')   # join the cleaned dataframe with the original, discarding the duplicate rows using a left join.
    return df_cln
Community
  • 1
  • 1
Carl
  • 598
  • 2
  • 11
  • 25

1 Answers1

4

Hows about this:

  • update all your columns with the max data.
  • pick a row (say the first).

This should be much faster as it's vectorized.

In [11]: g = df.groupby(["unique_id", "period_id"], as_index=False)

In [12]: g.transform("max")
Out[12]:
            liq
index
19     0.017610
1903   1.726374
12404  1.726374
13330  1.397143
14090  2.000716
12404  0.090525
13330  1.397143
14090  2.000716

In [13]: df.update(g.transform("max"))

In [14]: g.nth(0)
Out[14]:
          unique_id  period_id       liq
index
19         CAN00CE0     199001  0.017610
1903   **USA07WG0**     199001  1.726374
13330      USA08DE0     199001  1.397143
14090      USA04U80     199001  2.000716
12404      USA07WG0     199002  0.090525
13330      USA08DE0     199002  1.397143
14090      USA04U80     199002  2.000716

Note: I'd like to use groupby first or last here but I think there's a bug where they throw away your old index, I don't think they should... nth is the works however.


An alternative is to first slice out the ones which don't equal liq max:

(df[df["liq"] == g["liq"].transform("max")]  #  keep only max liq rows
 .groupby(["unique_id", "period_id"])
 .nth(0)
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Hi Andy, interesting. Are you saying transform the subset of the data (as per code snippet two), and then update df with this? or transform the entire dataset? more than two mins in its still executing g.transform("max"). More than 750k rows and more than 100 columns ... – Carl Dec 22 '15 at 17:42
  • @Carl The transform is the expensive part. BUT looking at the title again that may not be completely necessary... Do you just want the row with the highest liq field in each (unique_id, period_id) group? – Andy Hayden Dec 22 '15 at 18:00
  • That sounds right: (i) In case where duplicate fields: keep max(liq) field, discard the others. (ii) In case where no duplicate fields: keeping max should just keep the single row that is already there. – Carl Dec 22 '15 at 18:12
  • @Carl added an alternative as the last line. Will be a lot faster if that's all you need. – Andy Hayden Dec 22 '15 at 18:22
  • This still seems very slow: g = df.groupby(["unique_id", "period_id"], as_index=False). g["liq"].transform("max")].groupby(["unique_id", "period_id"]).nth(0). Indeed, it failed due to memory error! It occurred to me that it might be much faster to identify those groups that have duplicates using count. Then I can apply the max transformation to that grouping, and then recombine the two into one. Indeed, I can use max, or simply sort by value (descending) on these and take the first element using g.nth(0)? – Carl Dec 22 '15 at 18:46
  • I think the problem is that g["liq"].transform("max") resets the index, losing the original index in the process? certainly df[df["liq"] == g["liq"].transform("max") results in a memory error ... I'm still struggling with this. doing g.size() produces an effective count of the number of duplicates, and is very, very, fast, so I am trying to use this to get the unique_id and period_id pairs where size >2, then apply the max idea above to those which should be much faster, and then I'll need to recombine with the original data frame. – Carl Dec 22 '15 at 20:50
  • How's about you reset the index, then return the index of the maximum row for each group (for the groups with >1), and then iloc based on that. – Andy Hayden Dec 22 '15 at 21:51
  • Thanks Andy, still grapping with pandas. Any chance you could give me code? – Carl Dec 22 '15 at 23:12
  • Does this do it faster? If so will append to answer. https://gist.github.com/hayd/5aae84db3ce345e21190 – Andy Hayden Dec 23 '15 at 00:14
  • Thanks Andy, this is what I was trying to do so please add it (apologies for lateness, tied up with other debugging!). It is faster, but still slow: is there a vectorized solution? it is precisely max_mores = g2.agg(lambda x: x.loc[x.idxmax()]) that is slow. I'm not familiar with transform, but I used .size()? is there a drawback with that. – Carl Dec 24 '15 at 10:49