-1

I have 2 dataframes with different lengths. I would like to compare and delete unavailable values(rows) from df1.

Here is an example:

df1 = pd.DataFrame({'Filename':['image1','image1','image2','image3'], 
                    'Name':['Dog','Cat','Cat', 'Cat'],
                     'values':['2','3','4','5']  })

df2 = pd.DataFrame({'Filename':['image1','image2','image3'], 
                    'Name':['Dog','Cat', 'Cat'],
                     'values':['5','6','7']  })

df1

Filename    Name    values

0   image1  Dog 2

1   image1  Cat 3

2   image2  Cat 4

3   image3  Cat 5

df2

Filename    Name    values

0   image1  Dog 5

1   image2  Cat 6

2   image3  Cat 7

I'm expecting 2 dataframes(df1 and df2) with same length and with same Filename and Name as below. My goal is to compare the values column of df1 and df2 with same Filename and Name.

df1

Filename    Name    values

0   image1  Dog 2

2   image2  Cat 4

3   image3  Cat 5

df2

Filename    Name    values

0   image1  Dog 5

1   image2  Cat 6

2   image3  Cat 7

I have tried comparing each row with corresponding df and delete if not available. (This is clearly not the way to do)

for i, j in df1.iterrows():
    for m, n in df1.iterrows():
        if m['Filename'] == i['Filename']:
            if m['LabelName'] == i['LabelName']:
                pass
            else:
                print('delete')
                df2=df2.drop(i)
                df1=df1.sort_values('Filename')
                df2=df2.sort_values('Filename')
                
            break

I also tried to implement groupby and compare with rows and i encountered ValueError: Can only compare identically-labeled Series objects since indexes won't be same.

Can someone please help me with this? I tried searching for similar problems but did not come across any.

Praveen
  • 267
  • 1
  • 5
  • 19

2 Answers2

1

A solution not so pythonic, however it does the work:

l1=[(df1.Filename.iloc[i],df1.Name.iloc[i]) for i in range(len(df1))]
l2=[(df2.Filename.iloc[i],df2.Name.iloc[i]) for i in range(len(df2))]
lfin=[i for i in l1 if i in l2]
   
for i in df1.index:
    if (df1.Filename.loc[i], df1.Name.loc[i]) not in lfin:
        df1.drop(i, inplace=True)

for i in df2.index:
    if (df2.Filename.loc[i], df2.Name.loc[i]) not in lfin:
        df2.drop(i, inplace=True)
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
0

hey i think this work pretty well

df3 = df2.set_index('Filename')
df1[df1.apply(lambda x : df3.loc[x.Filename]['Name']== x.Name , axis =1 ) ]

and if you want to lose the index and reset it you can add

df3 = df2.set_index('Filename')
df1[df1.apply(lambda x : df3.loc[x.Filename]['Name']== x.Name , axis =1 )  ].reset_index().drop('index' , axis=1)