3

I have 2 gzipped files of about 1Gb each. I want to read in both files simultaneously and compare every fourth line of both files with each other. Is there a faster way then doing it like this?

import gzip

file1 = r"path\to\file1.gz"
file2 = r"path\to\file2.gz"


for idx, (line1, line2) in enumerate(zip(gzip.open(file1), gzip.open(file2)), start=1):
    if not idx%4:
        compare(line1, line2)
BioGeek
  • 21,897
  • 23
  • 83
  • 145

2 Answers2

2

You still have to iterate through both files, but this is cleaner:

import gzip
from itertools import islice, izip

file1 = r"path\to\file1.gz"
file2 = r"path\to\file2.gz"

with gzip.open(file1) as f1, gzip.open(file2) as f2:
    for line1, line2 in islice(izip(f1, f2)), 3, None, 4):
        compare(line1, line2)
Pavel Anossov
  • 60,842
  • 14
  • 151
  • 124
2

You can use itertools.islice(iterable, 3, None, 4) to iterate over every fourth item in iterable.

If you are on Python 2.x, use itertools.izip instead of zip to avoid reading everything in memory.

Janne Karila
  • 24,266
  • 6
  • 53
  • 94