[Using Python3] I want to compare the content of two csv files and let the script print if the contents are the same. In other words, it should let me know if all lines are matched and, if not, the number of rows that are mismatched.
Also I would like the flexibility to change the code later to write all rows that are not matched to another file.
Furthermore, although the two files should technically contain exactly the same, the rows may not be ordered the same (except for the first row, which contains headers).
The input files look something like this:
field1 field2 field3 field4 ...
string float float string ...
string float float string ...
string float float string ...
string float float string ...
string float float string ...
... ... ... ... ...
The code I am currently running with is the following (below), but to be very honest I am not sure if this is the best (most pythonic) way. Also I am not sure what the try: while 1: ...
code is doing. This code is the result of my scouring the forum and the python docs. So far the code runs a very long time.
As I am very new I am very keen to receive any feedback on the code, and would also kindly ask for an explanation on any of your possible recommendations.
Code:
import csv
import difflib
'''
Checks the content of two csv files and returns a message.
If there is a mismatch, it will output the number of mismatches.
'''
def compare(f1, f2):
file1 = open(f1).readlines()
file2 = open(f2).readlines()
diff = difflib.ndiff(file1, file2)
count = 0
try:
while 1:
count += 1
next(diff)
except:
pass
return 'Checked {} rows and found {} mismatches'.format(len(file1), count)
print (compare('outfile.csv', 'test2.csv'))
Edit: The file can contain duplicates so storing in a set will not work (because it will remove all duplicates, right?).