Python - Checking concordance between two huge text files

Question

So, this one has been giving me a hard time!
I am working with HUGE text files, and by huge I mean 100Gb+. Specifically, they are in the fastq format. This format is used for DNA sequencing data, and consists of records of four lines, something like this:

@REC1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT  
+  
!''*((((***+))%%%++)(%%%%).1***-+*''))*55CCF>>>>>>CCCCCCC65  
@REC2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT  
+  
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65  
.  
.  
.

For the sake of this question, just focus on the header lines, starting with a '@'.

So, for QA purposes, I need to compare two such files. These files should have matching headers, so the first record in the other file should also have the header '@REC1', the next should be '@REC2' and so on. I want to make sure that this is the case, before I proceed to heavy downstream analyses.
Since the files are so large, a naive iteration a string comparisson would take very long, but this QA step will be run numerous times, and I can't afford to wait that long. So I thought a better way would be to sample records from a few points in the files, for example every 10% of the records. If the order of the records is messed up, I'd be very likely to detect it.
So far, I have been able to handle such files by estimating the file size and than using python's file.seek() to access a record in the middle of the file. For example, to access a line approximately in the middle, I'd do:

file_size = os.stat(fastq_file).st_size
start_point = int(file_size/2)
with open(fastq_file) as f:
    f.seek(start_point)
    # look for the next beginning of record, never mind how

But now the problem is more complex, since I don't know how to coordinate between the two files, since the bytes location is not an indicator of the line index in the file. In other words, how can I access the 10,567,311th lines in both files to make sure they are the same, without going over the whole file?

Would appreciate any ideas\hints. Maybe iterating in parallel? but how exactly?
Thanks!

I indented your file sample to prevent SO from formatting it bold/italic etc. - I hope the result is correct. Please check if I messed up something. — Tim Pietzcker, Nov 18 '15 at 07:54
Request for clarification: You'll consider two files concordant if corresponding `@REC123` lines occur at the same line number in both files. No other criteria? — Tim Pietzcker, Nov 18 '15 at 07:57
@TimPietzcker - Thanks for the edit, and yes, this is the only criterion. Pretty straightforward... — soungalo, Nov 18 '15 at 08:30
By the way, for quality control to succeed, not a single discrepancy should be allowed, right? In other words, we want to abort at the first error? — Tim Pietzcker, Nov 18 '15 at 09:59

Alex Reynolds · Answer 1 · 2015-11-18T09:51:09.227

Sampling is one approach, but you're relying on luck. Also, Python is the wrong tool for this job. You can do things differently and calculate an exact answer in a still reasonably efficient way, using standard Unix command-line tools:

Linearize your FASTQ records: replace the newlines in the first three lines with tabs.
Run diff on a pair of linearized files. If there is a difference, diff will report it.

To linearize, you can run your FASTQ file through awk:

$ awk '\
    BEGIN { \
      n = 0; \
    } \
    { \
      a[n % 4] = $0; \
      if ((n+1) % 4 == 0) { \
        print a[0]"\t"a[1]"\t"a[2]"\t"a[3]; \
      } \
      n++; \
    }' example.fq > example.fq.linear

To compare a pair of files:

$ diff example_1.fq.linear example_2.fq.linear

If there's any difference, diff will find it and tell you which FASTQ record is different.

You could just run diff on the two files directly, without doing the extra work of linearizing, but it is easier to see which read is problematic if you first linearize.

So these are large files. Writing new files is expensive in time and disk space. There's a way to improve on this, using streams.

If you put the awk script into a file (e.g., linearize_fq.awk), you can run it like so:

$ awk -f linearize_fq.awk example.fq > example.fq.linear

This could be useful with your 100+ Gb files, in that you can now set up two Unix file streams via bash process substitutions, and run diff on those streams directly:

$ diff <(awk -f linearize_fq.awk example_1.fq) <(awk -f linearize_fq.awk example_2.fq)

Or you can use named pipes:

$ mkfifo example_1.fq.linear
$ mkfifo example_2.fq.linear
$ awk -f linearize_fq.awk example_1.fq > example_1.fq.linear &
$ awk -f linearize_fq.awk example_2.fq > example_2.fq.linear &
$ diff example_1.fq.linear example_2.fq.linear
$ rm example_1.fq.linear example_2.fq.linear

Both named pipes and process substitutions avoid the step of creating extra (regular) files, which could be an issue for your kind of input. Writing linearized copies of 100+ Gb files to disk could take a while to do, and those copies could also use disk space you may not have much of.

Using streams gets around those two problems, which makes them very useful for handling bioinformatics datasets in an efficient way.

You could reproduce these approaches with Python, but it will almost certainly run much slower, as Python is very slow at I/O-heavy tasks like these.

Thanks! I'll try it and compare to other approaches before I mark as answer. — soungalo, Nov 18 '15 at 09:43

score 2 · Answer 2 · answered Nov 18 '15 at 08:52

2

Iterating in parallel might be the best way to do this in Python. I have no idea how fast this will run (a fast SSD will probably be the best way to speed this up), but since you'll have to count newlines in both files anyway, I don't see a way around this:

with open(file1) as f1, open(file2) as f2:
    for l1, l2 in zip(f1,f2):
        if l1.startswith("@REC"):
            if l1 != l2:
                print("Difference at record", l1)
                break
    else:
        print("No differences")

This is written for Python 3 where zip returns an iterator; in Python 2, you need to use itertools.izip() instead.

answered Nov 18 '15 at 08:52

Tim Pietzcker

328,213
58
503
561

Thanks, but this is the kind of naive solutions I am trying to avoid. I'll try other approaches suggested here and see which works best. – soungalo Nov 18 '15 at 09:44
Sure, I understand that. I just suspect that there is no way around this. You'll have to read each line in every file anyway (otherwise there is no way to find out the current line number), and since you can't read them into memory, you have no choice but iterating in parallel...unless I'm missing something. – Tim Pietzcker Nov 18 '15 at 09:47
@soungalo Agree with TimPietzcker, all the three answers suggest you to read a file row by row. – u354356007 Nov 18 '15 at 09:49

score 1 · Answer 3 · answered Nov 18 '15 at 10:10

Have you looked into using the rdiff command.
The upsides of rdiff are:

with the same 4.5GB files, rdiff only ate about 66MB of RAM and scaled very well. It never crashed to date.
it is also MUCH faster than diff.
rdiff itself combines both diff and patch capabilities, so you can create deltas and apply them using the same program

The downsides of rdiff are:

it's not part of standard Linux/UNIX distribution – you have to install the librsync package.
delta files rdiff produces have a slightly different format than diff's.
delta files are slightly larger (but not significantly enough to care).
a slightly different approach is used when generating a delta with rdiff, which is both good and bad – 2 steps are required. The first one produces a special signature file. In the second step, a delta is created using another rdiff call (all shown below). While the 2-step process may seem annoying, it has the benefits of providing faster deltas than when using diff.

See: http://beerpla.net/2008/05/12/a-better-diff-or-what-to-do-when-gnu-diff-runs-out-of-memory-diff-memory-exhausted/

This looks interesting, thanks for posting this answer. – Alex Reynolds Nov 18 '15 at 21:48 — Alex Reynolds, Nov 18 '15 at 21:48

score 0 · Answer 4 · answered Nov 18 '15 at 09:38

import sys
import re

""" To find of the difference record in two HUGE files. This is expected to  
use of minimal memory. """

def get_rec_num(fd):
    """ Look for the record number. If not found return -1"""
    while True:
        line = fd.readline()
        if len(line) == 0: break
        match =  re.search('^@REC(\d+)', line)
        if match:
            num = int(match.group(1))
            return(num)
    return(-1)

f1 = open('hugefile1', 'r')
f2 = open('hugefile2', 'r')

hf1 = dict()
hf2 = dict()
while f1 or f2:
    if f1:
        r = get_rec_num(f1)
        if r < 0:
            f1.close()
            f1 = None
        else:
            # if r is found in f2 hash, no need to store in f1 hash  
            if not r in hf2:
                hf1[r] = 1
            else:
                del(hf2[r])
        pass
    pass
    if f2:
        r = get_rec_num(f2)
        if r < 0:
            f2.close()
            f2 = None
        else:
            # if r is found in f1 hash, no need to store in f2 hash  
            if not r in hf1:
                hf2[r] = 1
            else:
                del(hf1[r])
        pass
    pass

print('Records found only in f1:')
for r in hf1:
    print('{}, '.format(r));
print('Records found only in f2:')
for r in hf2:
    print('{}, '.format(r));

Thanks! I'll try it and compare to other approaches before I mark as answer. — soungalo, Nov 18 '15 at 09:43

u354356007 · Answer 5 · 2015-11-18T09:46:36.597

Both answers from @AlexReynolds and @TimPietzcker are excellent from my point of view, but I would like to put my two cents in. You also might want to speed up your hardware:

Raplace HDD with SSD
Take n SSD's and create a RAID 0. In the perfect world you will get n times speed up for your disk IO.
Adjust the size of chunks you read from the SSD/HDD. I would expect, for instance, one 16 MB read to be executed faster than sixteen 1 MB reads. (this applies to a single SSD, for RAID 0 optimization one has to take a look at RAID controller options and capabilities).

The last option is especially relevant to NOR SSD's. Don't pursuit the minimal RAM utilization, but try to read as much as it needs to keep your disk reading fast. For instance, parallel reads of single rows from two files can probably speed down reading - imagine an HDD where two rows of the two files are always on the same side of the same magnetic disk(s).

Python - Checking concordance between two huge text files

5 Answers5