Sorting csv / dataframes using Python and DataCompy reports

Question

I am trying to compare two csv files (not sorted) and would like to get a report like SAS Proc compare. I am using datacompy and sorting the dataframe before comparing but datacompy reports are showing that 'no rows are in common'.

Please let me know what I am missing in the below snippet.

I have tried sorting, reindexing and also instead of using join_columns, I have tried on_index=True as well.

from io import StringIO
import pandas as pd
import datacompy

data1 = """name,age,loc
ABC,123,LON
EFG,456,MAA
"""

data2 = """name,age,loc
EFG,457,MAA
ABC,124,LON
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

df1.sort_values(by=['name','age','loc']).reindex
df2.sort_values(by=['name','age','loc']).reindex

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name','age','loc'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

Expected result is

data1

name,age,loc

ABC,123,LON

EFG,456,MAA

data2

name,age,loc

ABC,123,LON

EFG,457,MAA

And the report should be like for the age column the max diff is 1 and all others are matching fine.

score 3 · Answer 1 · answered Aug 21 '19 at 15:11

You're joining on all three columns, and should only join on name. In your join change to the following:

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

Which will produce the following output:

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        3     2
1       new        3     2

Column Summary
--------------

Number of columns in common: 3
Number of columns in original but not in new: 0
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: name
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in original but not in new: 0
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 2

Columns with Unequal Values or Types
------------------------------------

  Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0    age          int64     int64          2       1.0            0

Sample Rows with Unequal Values
-------------------------------

  name  age (original)  age (new)
1  EFG             456        457
0  ABC             123        124

Thanks moe_95, works well for join on single column. is there anyway I can join on multiple columns ? — AKS, Aug 27 '19 at 11:59
No problem, you can speficy multiple columns in the `join_columns`, i.e. `join_columns=['name','age']` would work. What you can't do is join on all columns in each df. Also it would be great if you can accept this as the correct answer if it works for you! — moe_95, Aug 28 '19 at 10:51

Sorting csv / dataframes using Python and DataCompy reports

1 Answers1