0

I want to compare two text files in Python, and return the lines that are different. My attempt uses difflib, but I'm open to other suggestions. I need to get the lines that are different, as well as the lines that appear in one file but not the other. Order is somewhat important, but if a good solution exists that doesn't take order into consideration, I can let go of that.

The problem is that one file has lines that have multiple trailing characters \t and \n, while the other doesn't; I don't want to consider that as a difference. For other files, the first file has only \n and the other files has \t characters at the end. The lines contain elements that are separated by tabs or spaces, so those are important; I just don't care for the trailing characters \t and \n.

My solution:

from difflib import Differ

with open(file_path) as actual:
    with open(test_file_path) as test:
        differ = Differ()

        for line in differ.compare(actual.readlines(), test.readlines()):
            if line.startswith('-'):
                log.error('EXPECTED:  {}'.format(line[2:]))
            if line.startswith('+'):
                log.error('TEST FILE: {}'.format(line[2:]))

I expect the output to show EXPECTED and TEST FILE lines when there's a difference, and just EXPECTED or just TEST FILE when one contains a line the other doesn't. Right now, I'm seeing a lot of the following types of errors:

00:02:40: ERROR EXPECTED:  Issuer   Type    OBal    Net WAC OTerm   WAM Age GrossCpn    HighRemTerm Grp                                     

00:02:40: ERROR TEST FILE: Issuer   Type    OBal    Net WAC OTerm   WAM Age GrossCpn    HighRemTerm Grp

As you can see (if you highlight it), the first line contains a number of spaces after 'Grp' and the other line doesn't. I want to consider these two lines the same.

I've tried to explicitly specify the tabs and line breaks:

actual_file = actual.readlines()
expected_file = []
for line in actual_file:
    if line[-1] == '\n':
        expected_file.append(line.rstrip('\n').rstrip('\t') + '\n')
    else:
        expected_file.append(line.rstrip('\t'))

However, it (a) slows the process down quite a bit, and (b) is required for every file type in a different way, since some files have trailing tabs followed by line breaks, some have just line breaks, and some have nothing at all. If there's no better way, I can strip every line of every trailing tab and linebreak, but it seems like a lot of processing power (I have to run a lot of files) for something that seems fairly easy to resolve.

user2524282
  • 305
  • 1
  • 4
  • 13
  • Consider removing trailing spaces in your line. – lamandy Dec 07 '17 at 05:32
  • @lamandy I may be confusing everything, but if I remove the trailing spaces in the line after the `for line in differ....` statement, it won't do anything since differ already considers those lines different due to the trailing space. Is that incorrect? I suppose I can recreate the original file line by line, removing the trailing spaces, but that seems inefficient. – user2524282 Dec 07 '17 at 05:37
  • Instead of directly passing the line to differ, process both of them to remove trailing spaces before you pass to differ – lamandy Dec 07 '17 at 05:40
  • Right... I've been working on that after your initial comment, and ran into two issues: (1) It slowed the whole thing down quite a bit. Manageable, but noticeable. (2) As I'm going through the rest of the files, I noticed that in some cases, the first file has something like '\t\t\t\t\t\t\n' while the second has '\n', and in other cases the first file just has '\n' while the second file has '\t\t\t\n' and so on. Ideally, I'd like to pass trailing characters to ignore, i.e. \t and \n. I will modify the question to reflect this. – user2524282 Dec 07 '17 at 05:52
  • you are probably trying to solve a trivial problem, which has been overcomplicated. – user1767754 Dec 07 '17 at 07:46

1 Answers1

0

Take a look at string.rstrip() here: https://docs.python.org/2/library/string.html#string.rstrip

string.rstrip() should do exactly what you need by stripping whitespace off the end of a string, while leaving \t and \n characters before the end alone.

Check it out:

>>> import string
>>> s = "This \t is \t a \t line \t\t\t\n\n\n"
>>> print(s)
This     is      a   line



>>>
>>> s = string.rstrip(s)
>>> s
'This \t is \t a \t line'
>>> print(s)
This     is      a   line
>>>

Hope this helps!

Chad Lewis
  • 131
  • 10