I couldn't find anything addressing this issue; this is the closest I guess, but I can't figure out how to implement the ideas here.
Somehow I found myself looking at a dataframe like this:
data = [['apple', 'banana','pear','mango'], ['pasta', 'pasta','pasta','pasta'], ['onion', 'tomato','celery','potato'], ['dog', 'dog','dog','dog']]
df = pd.DataFrame(data)
df
Which outputs:
0 1 2 3
0 apple banana pear mango
1 pasta pasta pasta pasta
2 onion tomato celery potato
3 dog dog dog dog
The 2nd and 4th rows have identical values across all 4 columns and I would like to just get rid of them, so the final df looks like this:
0 1 2 3
0 apple banana pear mango
1 onion tomato celery potato
Using drop_duplicates()
doesn't do anything since there are no duplicate rows. Same with duplicated()
.
The following is the only idea (if you can call it that) that I could think of. If I run
df.transpose()
I get
0 1 2 3
0 apple pasta onion dog
1 banana pasta tomato dog
2 pear pasta celery dog
3 mango pasta potato dog
Now if I run duplicated()
on, say, the 4th column:
df.duplicated(3)
I get
0 False
1 True
2 True
3 True
dtype: bool
So maybe I can come up with a function that would transpose the df, run duplicated()
on each column, drop the column if all values, except for the first, come back as True
and then transpose the df back to its original shape.
But I don't know how to do that; also, I would intereseted to know if there is a more elegant way of getting to the same place.