0

I need to compare two .csv files (files are over 65000 lines). Find lines that are not in the second file. I am using difflib.ndiff:

for line in difflib.ndiff(text1, text2):
    print(line,)

But I get unexpected results. The function finds two identical strings and marks them as different:

+ Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,
- Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,
  1. What could be the problem?
  2. What might be a suitable way to find the differences?

2.

from itertools import izip_longest
l1 = map(lambda x: x.strip(), list(open('test1.txt')))
l2 = map(lambda x: x.strip(), list(open('test2.txt')))
diff_list = izip_longest(l1, l2)
for diff in diff_list:
    print '%s %s %s' % (
        diff[0] or '', 
        '==' if diff[0] == diff[1] else '!=',
        diff[1] or '',
    )

I tried to use the following code to compare files, but I got the same unexpected result, why is this so?

Gr4,DQ_1Gb_1m_DR_926_23486,100,,,70,,!=Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,
Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,!=Gr4,DQ_1Gb_1m_DR_926_23486,100,,,70,,
stammer
  • 1
  • 1
  • have you tried pandas? – Vishal Upadhyay Aug 06 '20 at 09:39
  • if you're using linux you should use `diff` or `rdiff.` 65000 lines is relatively small and can be done programatically, however if you start going into the millions python has a very hard time with malloc and comparisons: pandas is usually the best bet if you do need to use python – benjessop Aug 06 '20 at 09:43
  • I have a python script ready already. The only problem is that difflib does not work correctly. I need to compare each line of a file (there may be differences in any field of the line) and output the lines not found – stammer Aug 06 '20 at 11:59
  • on your last code, cast your diff items to string. For example, `str(diff[0])` – anlgrses Aug 11 '20 at 04:51

1 Answers1

0

This is easy when you use pandas. Since you're not provided the dataset. I'll use my own.

Assume, i've two csv's.

enter image description here

Data looks like this :

enter image description here

Now print line, that is not present in second file (benz model in not present in second file):

enter image description here

CodeRed
  • 81
  • 6