Python - Delete duplicates in a dataframe based on two columns combinations?

Question

I have a dataframe with 3 columns in Python:

Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1

and would like to eliminate the duplicates based on columns Name1 and Name2 combinations.

In my example both rows are equal (but they are in different order), and I would like to delete the second row and just keep the first one, so the end result should be:

Name1 Name2 Value
Juan  Ale   1

Any idea will be really appreciated!

score 28 · Answer 1 · answered Jul 05 '18 at 01:49

28

By using np.sort with duplicated

df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
Out[614]: 
  Name1 Name2  Value
1   Ale  Juan      1

Performance

df=pd.concat([df]*100000)

%timeit df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
10 loops, best of 3: 69.3 ms per loop
%timeit df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
1 loop, best of 3: 3.72 s per loop

answered Jul 05 '18 at 01:49

BENY

317,841
20
164
234

4

This is 100 times better than the frozenset method. – cs95 Jul 05 '18 at 01:57
2

@coldspeed Yep , even the sort is match fast , I still keep your suggestion in mind, `apply` is less efficient. :-) – BENY Jul 05 '18 at 02:01
Thanks! It's super quick! I had to make a small change, since it was raising: `'IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)'` but this did the trick: `df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated().values]` – Daniel Sharp Jan 16 '20 at 09:00

jpp · Accepted Answer · 2018-07-05T07:07:03.210

23

You can convert to frozenset and use pd.DataFrame.duplicated.

res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]

print(res)

  Name1 Name2  Value
0  Juan   Ale      1

frozenset is necessary instead of set since duplicated uses hashing to check for duplicates.

Scales better with columns than rows. For a large number of rows, use @Wen's sort-based algorithm.

edited Jul 05 '18 at 07:07

answered Jul 05 '18 at 01:16

jpp

159,742
34
281
339

score 7 · Answer 3 · answered Jul 05 '18 at 03:25

Know Im kinda late for this question but giving my contribution anyway :)

You can also use get_dummies and add for a good way of creating hashable rows

df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]

Times are not as good as @Wen's answer, but it isstill way faster than apply+frozen_set

df=pd.concat([df]*1000000)
%timeit df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
1.8 s ± 85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[pd.DataFrame(np.sort(df[['a','b']].values,1)).duplicated()]
1.26 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df[~df[['a', 'b']].apply(frozenset, axis=1).duplicated()]
1min 9s ± 684 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

interesting.. `get_dummies` for this purpose is very innovative :) — anky, Dec 16 '19 at 16:45

Python - Delete duplicates in a dataframe based on two columns combinations?

3 Answers3

Linked

Related