I am trying to compare two csv files (not sorted) and would like to get a report like SAS Proc compare. I am using datacompy and sorting the dataframe before comparing but datacompy reports are showing that 'no rows are in common'.
Please let me know what I am missing in the below snippet.
I have tried sorting, reindexing and also instead of using join_columns, I have tried on_index=True as well.
from io import StringIO
import pandas as pd
import datacompy
data1 = """name,age,loc
ABC,123,LON
EFG,456,MAA
"""
data2 = """name,age,loc
EFG,457,MAA
ABC,124,LON
"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))
df1.sort_values(by=['name','age','loc']).reindex
df2.sort_values(by=['name','age','loc']).reindex
compare = datacompy.Compare(
df1,
df2,
join_columns=['name','age','loc'], #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
compare.matches()
print(compare.report())
Expected result is
data1
name,age,loc
ABC,123,LON
EFG,456,MAA
data2
name,age,loc
ABC,123,LON
EFG,457,MAA
And the report should be like for the age column the max diff is 1 and all others are matching fine.