0

Whereas panda's drop_duplicates function can be specified with "first", "last", or False. I want to be able to keep N amount of duplicates. Instead of keeping just one (e.g. with "first" or "last"), or none (with "False"), I want to keep a certain amount of the duplicates.

Any help is appreciated!

You_Donut
  • 155
  • 8
  • 2
    I suspect what you're looking for can be handled with [GroupBy.head()](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.head.html). If you want anything more specific than that, I'd need a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Ben Grossmann Apr 04 '22 at 18:30

2 Answers2

1

Something like this could work, but you haven't specified whether you are using one or more column(s) to deduplicate:

n = 3
df.groupby('drop_dup_col').head(n)

This can be used to keep the first three duplicates based on a column value from the top (head) of the dataframe. If you want to start from the bottom of the df, you can use .tail(n) instead.

Change n to the amount of rows you want to keep and change 'drop_dup_col' to the column name you are using to dedup your df.

Multiple columns can be specified in groupby using:

df.groupby(['col1','col5'])

Regarding the question in your comment:

It's a bit hard to implement, because if you want to say delete 3 duplicates there should also be a minimum of 3 duplicates, otherwise in case 2 duplicates occur they will be deleted from the data and no row is kept.

n = 3
df['dup_count'] = df.groupby('drop_dup_col').transform('size')
df2 = df
df2 = df2.loc[df['dup_count'] >= n]
df3 = pd.concat([df, df2])
df3.drop_duplicates(keep=False)
Stijn
  • 121
  • 9
1

I believe a combination of groupby and tail(N) should work for this- In this case, if you want to keep 4 duplicates in df['myColumnDuplicates']:

df.groupby('myColumnDuplicates').tail(4)

To be more precise, and complete the answer with @Stijn 's answer, tail(n) would keep the last n duplicated values found- while head(n) should keep the first n duplicated values

Daniel Weigel
  • 1,097
  • 2
  • 8
  • 14