0

I couldn't find anything addressing this issue; this is the closest I guess, but I can't figure out how to implement the ideas here.

Somehow I found myself looking at a dataframe like this:

data = [['apple', 'banana','pear','mango'], ['pasta', 'pasta','pasta','pasta'], ['onion', 'tomato','celery','potato'], ['dog', 'dog','dog','dog']]
df = pd.DataFrame(data) 
df 

Which outputs:

        0   1         2     3
0   apple   banana  pear    mango
1   pasta   pasta   pasta   pasta
2   onion   tomato  celery  potato
3   dog     dog     dog     dog

The 2nd and 4th rows have identical values across all 4 columns and I would like to just get rid of them, so the final df looks like this:

        0   1         2     3
0   apple   banana  pear    mango
1   onion   tomato  celery  potato

Using drop_duplicates() doesn't do anything since there are no duplicate rows. Same with duplicated().

The following is the only idea (if you can call it that) that I could think of. If I run

df.transpose()

I get

        0   1       2        3
0   apple   pasta   onion   dog
1   banana  pasta   tomato  dog
2   pear    pasta   celery  dog
3   mango   pasta   potato  dog

Now if I run duplicated() on, say, the 4th column:

df.duplicated(3)

I get

0    False
1     True
2     True
3     True
dtype: bool

So maybe I can come up with a function that would transpose the df, run duplicated() on each column, drop the column if all values, except for the first, come back as True and then transpose the df back to its original shape.

But I don't know how to do that; also, I would intereseted to know if there is a more elegant way of getting to the same place.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • 2
    you can use : `df[df.nunique(1)>1]` – anky Sep 12 '19 at 15:14
  • @anky_91 - Wow, that was both fast and correct! Can you please explain, for those of us still straggling with pandas (preferably in an answer), how that accomplishes the task? – Jack Fleeting Sep 12 '19 at 15:17
  • 1
    @rafaelc - You are right on both counts. `df.nunique` is more elegant, but in other situations the ability to test for `duplicated()` across columns may come handy! – Jack Fleeting Sep 12 '19 at 15:20
  • as requested, i posted an answer with detailed explanation. I tried, Please let me know if something is unclear – anky Sep 12 '19 at 15:26

1 Answers1

3

You can use df.nunique() along axis=1 and check for rows which has more than 1 unique value for all columns.:

Per docs: nunique()

Count distinct observations over requested axis.

Hence if we test:

df.nunique(1)

This outputs:

0    4
1    1
2    4
3    1

Naturally

df.nunique(1)>1

Would return:

0     True
1    False
2     True
3    False

so with help of boolean indexing we can just do:

df[df.nunique(1)>1]

Which returns the desired output:

       0       1       2       3
0  apple  banana    pear   mango
2  onion  tomato  celery  potato
anky
  • 74,114
  • 11
  • 41
  • 70