-1

I have 2 xlsx files and need to print the differences in each cell in the file. The code that I am using now is working but I need to ignore the first column in each of the xlsx files and I am not sure how to add that exception to the code I am currently using.

ds1 = xlrd.open_workbook("PATH1")
ds2 = xlrd.open_workbook("PATH2")
SHEET1 = ds1.sheet_by_index(0)
SHEET1 = ds2.sheet_by_index(0)

for rownum in range(max(POB_ds1.nrows, POB_ds2.nrows)):
if rownum < SHEET1_ds1.nrows:
    row_rb1 = SHEET1_ds1.row_values(rownum)
    row_rb2 = SHEET1_ds2.row_values(rownum)

    for colnum, (c1, c2) in enumerate(zip_longest(row_rb1, row_rb2)):
        if c1 != c2:
            print ("Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2))
    else:
    print ("Row {} missing".format(rownum+1))
  • 1
    You have `colnum`, did you try to use it in an `if` condition? hint: it will be `0` for the first column – DeepSpace Jun 08 '22 at 15:19
  • I'm not sure where to add an if condition if the column equals "name of column" Or would there be any way to ignore the first column of the table when importing the xlsx file? @DeepSpace – Ashwin Shaker Jun 08 '22 at 16:14
  • try to delete the first column of each dataframe, and if you need it later, try to store them in a variable before deleting and concatenating later – Rafael MR Jun 08 '22 at 16:23

1 Answers1

0

How about this approach?

import pandas as pd
import numpy as np

# Next, read in both of our excel files into dataframes
# Showing examples of several parameters...in your case maybe not all parameters are necessary
df1 = pd.read_excel('C:\\Users\\Excel\\Desktop\\Coding\\Python\\Excel\\Compare Two Excel Files\\Book1.xlsx', 'Sheet1', na_values=['NA'], header=0,  skiprows=0, nrows=1000, usecols="B:Z")
df2 = pd.read_excel('C:\\Users\\Excel\\Desktop\\Coding\\Python\\Excel\\Compare Two Excel Files\\Book2.xlsx', 'Sheet1', na_values=['NA'], header=0,  skiprows=0, nrows=1000, usecols="B:Z")

# Order by account number and reindex so that it stays this way.


df1.sort_index(by=["H1"])
df1=df1.reindex()
df2.sort_index(by=["H1"])
df2=df2.reindex()

# Create a diff function to show what the changes are.

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

# Merge the two datasets together in a Panel . I will admit that I haven’t fully grokked the panel concept yet but the only way to learn is to keep pressing on!

diff_panel = pd.Panel(dict(df1=df1,df2=df2))

# Once the data is in a panel, we use the report_diff function to highlight all the changes. I think this is a very intuitive way (for this data set) to show changes. It is relatively simple to see what the old value is and the new one. For example, someone could easily check and see why that postal code changed for account number 880043.

diff_output = diff_panel.apply(report_diff, axis=0)
diff_output.tail()


# One of the things we want to do is flag rows that have changes so it is easier to see the changes. We will create a has_change function and use apply to run the function against each row.

def has_change(row):
    if "--->" in row.to_string():
        return "Y"
    else:
        return "N"


diff_output['has_change'] = diff_output.apply(has_change, axis=1)
diff_output.tail()

# It is simple to show all the columns with a change:

diff_output[(diff_output.has_change == 'Y')]


# Finally, let’s write it out to an Excel file:

diff_output[(diff_output.has_change == 'Y')].to_excel('C:\\Users\\Excel\\Desktop\\Coding\\Python\\Excel\\Compare Two Excel Files\\diff.xlsx')

See the link below for all details.

https://pbpython.com/excel-diff-pandas.html

ASH
  • 20,759
  • 19
  • 87
  • 200
  • this works but i need to print out the differences like the code I showed instead of having it display on an excel sheet – Ashwin Shaker Jun 08 '22 at 22:51