1

I would like to use something similar to dropping the duplicates of a DataFrame. I would like columns' order not to matter. What I mean is that the function shuold consider a row consisting of the entries 'a', 'b' to be identical to a row consisting of the entries 'b', 'a'. For example, given

df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['a', 'b'], ['b', 'a']])

   0  1
0  a  b
1  c  d
2  a  b
3  b  a

I would like to obtain:

   0  1
0  a  b
1  c  d

where the preference is for efficiency, as I run this on a huge dataset within a groupby operation.

splinter
  • 3,727
  • 8
  • 37
  • 82

1 Answers1

1

Call np.sort first, and then drop duplicates.

df[:] = np.sort(df.values, axis=1)
df.drop_duplicates()

   0  1
0  a  b
1  c  d
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Just wondering, why do you use the `:` in the first row? Why not just `df = np.sort(df.values, axis=1)`? – splinter Jan 28 '18 at 11:44
  • 1
    @splinter It's a little technique I use for updating the dataframe inplace without creating a new one. `np.sort` returns an array. I assign it back to `df` using `df[:] = ...` – cs95 Jan 28 '18 at 11:44