I sorted two dataframes by id that contain the same values but am getting that they are not equal

Question

I have two dataframes:

df1

 ID        Name
 15        Max
 7         Stacy
 3         Frank
 2         Joe

df2

 ID        Name
 2         Abigail
 3         Josh
 15        Jake 
 7         Brian

I sorteded them by doing

df1 = df1.sort_values(by=['ID'])
df2 = df2.sort_values(by=['ID'])

to get

df1

 ID        Name
 2         Joe  
 3         Frank
 7         Stacy
 15        Max

df2

 ID        Name
 2         Abigail
 3         Josh
 7         Brian
 15        Jake

However when I check that the 'ID' column is the same across both dataframes by doing

print(df1['ID'].equals(df2['ID']))

it returns False, why is this so? Is there another method I can use to return that the two columns are equal?

you could `print(df1['ID'] == df2['ID'])` to figure out what elements are different. — Grisha, Feb 17 '23 at 16:11

score 0 · Answer 1 · answered Feb 17 '23 at 16:11

They're still being compared on the original indices:

import io

df1 = pd.read_csv(io.StringIO('''ID        Name
15        Max
7         Stacy
3         Frank
2         Joe''', sep='\s+'))
df2 = pd.read_csv(io.StringIO('''ID        Name
2         Abigail
3         Josh
15        Jake 
7         Brian'''), sep='\s+')

df1 = df1.sort_values(by=['ID'])
df2 = df2.sort_values(by=['ID'])

What is basically happening is that it is checking whether ID and ID_other in the following data frame are equal; they are not.

>>> df1.join(df2, rsuffix='_other')
   ID   Name  ID_other Name_other
3   2    Joe         7      Brian
2   3  Frank        15       Jake
1   7  Stacy         3       Josh
0  15    Max         2    Abigail

If you want to check equality without regard to index, consider:

df1['ID'].values == df2['ID'].values

Or reset indices on both sides then use eq.

score 0 · Answer 2 · answered Feb 17 '23 at 16:12

The frames have most probably different indices. You should do:

df1 = df1.sort_values(by=['ID']).reset_index(drop=True)
df2 = df2.sort_values(by=['ID']).reset_index(drop=True)
print(df1['ID'].equals(df2['ID']))  # this returns True

Alternative:


import pandas as pd

df1 = pd.DataFrame({'ID': [15, 7, 3, 2], 'Name': ['Max', 'Stacy', 'Frank', 'Joe']})
df2 = pd.DataFrame({'ID': [2, 3, 15, 7], 'Name': ['Abigail', 'Josh', 'Jake', 'Brian']})

df1 = df1.sort_values(by=['ID']).reset_index(drop=True)
df2 = df2.sort_values(by=['ID']).reset_index(drop=True)

print(df1['ID'].equals(df2['ID']))  # should return True

I'mahdi · Answer 3 · 2023-02-17T16:20:25.310

You don't need to sort. You can use pandas.DataFrame.set_index the use pandas.DataFrame.eq.

df1.set_index('ID').eq(df2.set_index('ID'))

For example if df1 and df2 like:

>>> print(df1)
#    ID   Name
# 0  15    Max
# 1   7  Stacy
# 2   3  Frank
# 3   2    Joe

>>> print(df2)
#    ID     Name
# 0   2  Abigail
# 1   3     Josh
# 2  15      Max
# 3   7    Brian

>>> df1.set_index('ID').eq(df2.set_index('ID'))
     Name
ID       
2   False
3   False
7   False
15   True

I sorted two dataframes by id that contain the same values but am getting that they are not equal

3 Answers3