Looking for an analogue to pd.DataFrame.drop_duplicates() where order does not matter

Question

I would like to use something similar to dropping the duplicates of a DataFrame. I would like columns' order not to matter. What I mean is that the function shuold consider a row consisting of the entries 'a', 'b' to be identical to a row consisting of the entries 'b', 'a'. For example, given

df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['a', 'b'], ['b', 'a']])

   0  1
0  a  b
1  c  d
2  a  b
3  b  a

I would like to obtain:

   0  1
0  a  b
1  c  d

where the preference is for efficiency, as I run this on a huge dataset within a groupby operation.

score 1 · Accepted Answer · answered Jan 28 '18 at 11:37

1

Call np.sort first, and then drop duplicates.

df[:] = np.sort(df.values, axis=1)
df.drop_duplicates()

   0  1
0  a  b
1  c  d

answered Jan 28 '18 at 11:37

cs95

379,657
97
704
746

Just wondering, why do you use the `:` in the first row? Why not just `df = np.sort(df.values, axis=1)`? – splinter Jan 28 '18 at 11:44
1

@splinter It's a little technique I use for updating the dataframe inplace without creating a new one. `np.sort` returns an array. I assign it back to `df` using `df[:] = ...` – cs95 Jan 28 '18 at 11:44

Looking for an analogue to pd.DataFrame.drop_duplicates() where order does not matter

1 Answers1