Remove duplicate rows based on values in every column using pandas

Question

I have a pandas df of different permutations of values: (toy version below, but my actual df contains more columns and rows)

My goal is to remove the rows that contain duplicate values across rows but critically with also checking all columns.

import itertools
check = list(itertools.permutations([1, 2, 3]))
test = pd.DataFrame(check, columns =['A', 'B', 'C'])

index   A   B   C
0       1   2   3
1       1   3   2
2       2   1   3
3       2   3   1
4       3   1   2
5       3   2   1

Desired output:

index   A   B   C
0       1   2   3
3       2   3   1
4       3   1   2

For example, I want to drop row 1 because both it and row 0 contain a 1 in the A column. I also want to drop row 2 because it and row 0 contain a 3 in the C column. And I want to drop row 5 because it and row 4 contain a 3 in the A column and because it and row 0 contain a 2 in the B column.

In other words, I am trying to generate a dataframe that contains unique combinations. Not permutations.

The way you're describing, sounds like pandas DataFrame.drop_duplicates() solves your problem. Doesn't it? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html — hknjj, Nov 11 '22 at 06:30
hmm thanks, I've tried `drop_duplicates` on this toy version, but the problem is that each row is in fact unique. `drop_duplicates` doesn't appear to check whether values are repeating _across_ all the rows. If I use e.g., `subset='A'`, again it doesn't check that there are duplicates across the other columns. Unless I'm misunderstanding how to use the function? — psychcoder, Nov 11 '22 at 06:37

score 0 · Answer 1 · answered Nov 11 '22 at 07:20

0

I'm sorry I couldn't find a way without loop

for i in test.index:
    if test.loc[i].eq(test.loc[:i-1]).sum().sum() > 0:
        test.drop(i, inplace=True)

output(test):

    A   B   C
0   1   2   3
3   2   3   1
4   3   1   2

answered Nov 11 '22 at 07:20

Panda Kim

6,246
2
12

Timus · Answer 2 · 2022-11-14T10:20:17.363

Here's an alternative, which also doesn't look nice, but is a lot faster than the other solution (at least in the tests I've done here):

def check(row, col_items):
    check = False
    if not any(item in items for item, items in zip(row, col_items)):
        check = True
        for item, items in zip(row, col_items):
            items.add(item)
    return check

col_items = [set() for _ in test.columns]
test = test[test.apply(check, axis=1, args=(col_items,))]

You could make it a little bit faster by replacing .apply with a loop over the rows, but it's not much.

Remove duplicate rows based on values in every column using pandas

2 Answers2