Compare 2 dataframe columns and delete rows

Question

I have 2 dataframes with different lengths. I would like to compare and delete unavailable values(rows) from df1.

Here is an example:

df1 = pd.DataFrame({'Filename':['image1','image1','image2','image3'], 
                    'Name':['Dog','Cat','Cat', 'Cat'],
                     'values':['2','3','4','5']  })

df2 = pd.DataFrame({'Filename':['image1','image2','image3'], 
                    'Name':['Dog','Cat', 'Cat'],
                     'values':['5','6','7']  })

df1

Filename    Name    values

0   image1  Dog 2

1   image1  Cat 3

2   image2  Cat 4

3   image3  Cat 5

df2

Filename    Name    values

0   image1  Dog 5

1   image2  Cat 6

2   image3  Cat 7

I'm expecting 2 dataframes(df1 and df2) with same length and with same Filename and Name as below. My goal is to compare the values column of df1 and df2 with same Filename and Name.

df1

Filename    Name    values

0   image1  Dog 2

2   image2  Cat 4

3   image3  Cat 5

df2

Filename    Name    values

0   image1  Dog 5

1   image2  Cat 6

2   image3  Cat 7

I have tried comparing each row with corresponding df and delete if not available. (This is clearly not the way to do)

for i, j in df1.iterrows():
    for m, n in df1.iterrows():
        if m['Filename'] == i['Filename']:
            if m['LabelName'] == i['LabelName']:
                pass
            else:
                print('delete')
                df2=df2.drop(i)
                df1=df1.sort_values('Filename')
                df2=df2.sort_values('Filename')
                
            break

I also tried to implement groupby and compare with rows and i encountered ValueError: Can only compare identically-labeled Series objects since indexes won't be same.

Can someone please help me with this? I tried searching for similar problems but did not come across any.

you should use [pd.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). Something like `df1.merge(df2, on=['Filename', 'Name'], how='inner')` should work. However, your question is surely a duplicate, you could have found the answer elsewhere :) — Mrml91, Sep 06 '20 at 09:19

IoaTzimas · Answer 1 · 2020-09-06T09:34:31.893

A solution not so pythonic, however it does the work:

l1=[(df1.Filename.iloc[i],df1.Name.iloc[i]) for i in range(len(df1))]
l2=[(df2.Filename.iloc[i],df2.Name.iloc[i]) for i in range(len(df2))]
lfin=[i for i in l1 if i in l2]
   
for i in df1.index:
    if (df1.Filename.loc[i], df1.Name.loc[i]) not in lfin:
        df1.drop(i, inplace=True)

for i in df2.index:
    if (df2.Filename.loc[i], df2.Name.loc[i]) not in lfin:
        df2.drop(i, inplace=True)

Amine Soudri · Answer 2 · 2020-09-06T09:26:55.217

0

hey i think this work pretty well

df3 = df2.set_index('Filename')
df1[df1.apply(lambda x : df3.loc[x.Filename]['Name']== x.Name , axis =1 ) ]

and if you want to lose the index and reset it you can add

df3 = df2.set_index('Filename')
df1[df1.apply(lambda x : df3.loc[x.Filename]['Name']== x.Name , axis =1 )  ].reset_index().drop('index' , axis=1)

edited Sep 06 '20 at 09:26

answered Sep 06 '20 at 09:23

Amine Soudri

21
2

Thanks Amine. I guess i found my answer in comments section – Praveen Sep 06 '20 at 09:24

Compare 2 dataframe columns and delete rows

2 Answers2