Compare two multiple-column csv files

Question

[Using Python3] I want to compare the content of two csv files and let the script print if the contents are the same. In other words, it should let me know if all lines are matched and, if not, the number of rows that are mismatched.

Also I would like the flexibility to change the code later to write all rows that are not matched to another file.

Furthermore, although the two files should technically contain exactly the same, the rows may not be ordered the same (except for the first row, which contains headers).

The input files look something like this:

field1  field2  field3  field4  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
string  float   float   string  ...
...     ...     ...     ...     ...

The code I am currently running with is the following (below), but to be very honest I am not sure if this is the best (most pythonic) way. Also I am not sure what the try: while 1: ... code is doing. This code is the result of my scouring the forum and the python docs. So far the code runs a very long time.

As I am very new I am very keen to receive any feedback on the code, and would also kindly ask for an explanation on any of your possible recommendations.

Code:

import csv
import difflib

'''
Checks the content of two csv files and returns a message.
If there is a mismatch, it will output the number of mismatches.
'''

def compare(f1, f2):

    file1 = open(f1).readlines()
    file2 = open(f2).readlines()

    diff = difflib.ndiff(file1, file2)

    count = 0

    try:
        while 1:
            count += 1
            next(diff)
    except:
        pass

    return 'Checked {} rows and found {} mismatches'.format(len(file1), count)

print (compare('outfile.csv', 'test2.csv'))

Edit: The file can contain duplicates so storing in a set will not work (because it will remove all duplicates, right?).

You mention "the rows may not be ordered the same". Can you sort them before comparison, or is the different ordering a difference you are looking for? — Janne Karila, Jun 19 '13 at 05:55

Janne Karila · Answer 1 · 2013-06-18T12:55:16.477

2

The try-while block simply iterates over diff, you should use a for loop instead:

count = 0
for delta in diff:
    count += 1

or an even more pythonic generator expression

count = sum(1 for delta in diff)

(The original code increments count before each iteration and thus gives a count higher by one. I wonder if that is correct in your case.)

edited Jun 18 '13 at 12:55

answered Jun 18 '13 at 12:48

Janne Karila

24,266
6
53
94

Thanks for this Janne, I'm currently running the changed code but it's taking ages to complete - which is weird? – Matthijs Jun 18 '13 at 13:18

score 0 · Answer 2 · answered Jun 18 '13 at 13:11

0

To answer your question about while 1:

Please read more about Generators and iterators.

Diff.ndiff() is a generator, which returns and iterator. The loop is iterating over it by calling next(). As long as it finds the diff (iterator moves next) it increments the count (which gives you the total number of rows that differ)

answered Jun 18 '13 at 13:11

Mukul Joshi

324
2
4

Hi Mukul, I did get that part on generators and iterators, but I reckon that I definitely need more knowledge on that area since I'm very new to (Python) programming. Thanks for the input! – Matthijs Jun 18 '13 at 13:19

Compare two multiple-column csv files

2 Answers2