1

I'm comparing a bunch of fastq.gz files. Each file is ~4G:

if filecmp.cmp(f1,f2,shallow=False)

It returns false, as in f1 and f2 are different. But when I compare the files using diff/comm I get 0 output (I unzip and then compare). I tried both shallow=True and False. I'm trying to print out the difference but it's running out of memory.

diff=difflib.ndiff((gzip.open(f1)).readlines(),(gzip.open(f2)).readlines())
print [i for i in diff if i.startswith('+')]

Is it because the files are gzipped? any ideas on how to compare them without unzipping them? (each file is 200M lines)

Thank you!

Nathan Villaescusa
  • 17,331
  • 4
  • 53
  • 56
FairyDuster
  • 145
  • 3
  • 13

1 Answers1

2

In general you would need to compare the uncompressed output. That is the only way to definitively determine if the two gzip files have the same uncompressed contents. They could have been compressed with different compression levels or different gzip software, giving different compressed results. The only guarantee is that when you compress and then decompress, you get the original input. There is no guarantee whatsoever that when you decompress and then compress that you get the original input.

If you are in control of the gzip process, using the same code and the same compression levels and other options, you can still get different output due to the header contents. The headers may have different time stamps, different file names, or other variations. In that case you can skip the headers for each (using RFC 1952 as your guide to when the headers end), and the compare the remainder of each. Given the stated conditions, the remainders of the two files will then be identical.

Another thing that you can do, again if you are in control of the compression and you know that each gzip file consists of a single gzip member, is that you can check the last eight bytes of each file. If those are not identical, then the compressed data is different. If they are the same, then the contents may be identical, so you would then need to decompress and compare, or use the method above. This can save a lot of time in almost never having to compare gzip files that have different uncompressed content. Those last eight bytes are the four-byte CRC of the uncompressed data, and the length of the uncompressed data modulo 232.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thank you so much @Mark. I actually don't have any control over the gzipping process. Do you have any suggestions on how to unzip and compare the files? (other then gzip.open each one and loop line by line..) Can I somehow do it using filecmp? – FairyDuster Apr 26 '18 at 20:37
  • 1
    If you are only trying to determine if they are the same or different, I would not use diff or reading line-by-line to do that. diff can consume a lot of memory trying to maintain histories to find matching data. Line-by-line can consume memory if you go a long time without a new-line, which can easily occur in binary data. You should read and decompress both gzip files and compare the binary results byte-for-byte. Once you see any difference, you stop and abort both decompressions. If you make it to the end, then the two are identical. – Mark Adler Apr 27 '18 at 01:00