How can I zip together two very large strings and return indices for matches and mismatches?

Question

I have a set of text files which contain two very large sets of characters of identical length. The sets of characters are DNA sequences so I'm going to call them seq_1 and seq_2 and together they are called an alignment. Files look like this:

>HEADER1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The possible characters in sequence 1 under >HEADER1 are ACGTN- and the possible characters in sequence 2 under >HEADER2 are ACGTN-*.

I want to analyze the sequences and return two lists of indices which I will call valid and mismatch.

valid contains all (1-based) indices where positions in both sequences (the "alignment") are in the set ACGT ; mismatch contains all (1-based) indices where positions in the alignment are in the set ACGT but do not match one another. Thus mismatch is a subset of valid.

A final condition is that I do NOT increment the index counter at positions where sequence 1 is "-" because these are considered "gaps" in the coordinate system I am using.

This example shows an alignment my expected output:

seq_1      = 'CT-A-CG'  # total length = 5 because of the two gaps
seq_2      = 'CNCTA*G'
valid      = [1,3,5]    # all indices where both sequences are in 'ACGT' without counting gaps
mismatch   = [3]        # subset of valid where alignment does not match

I am looking to improve my current code (below), which involves regex extraction of the sequences, zipping and enumeration of non-gap sites into a generator object and then --the main time consuming step-- for-looping through this generator and filling the two lists. I feel like there must be an array-based or itertools solution to this problem that would be much more efficient than a for-loop through the sequences and their index zipped together and I am looking for suggestions.

Code:

def seq_divergence(infile):

    re_1 = re.compile('>HEADER1\n([^>]+)', re.MULTILINE)
    re_2 = re.compile('>HEADER2\n([^>]+)', re.MULTILINE)    

    valid     = []
    mismatch  = []

    mycmp     = cmp  # create local cmp for faster calling in the loop

    with open(infile, 'rb') as f:

        alignment = f.read()

    # get sequences and remove header and newlines
    seq_1       = iter(re_1.search(alignment).group(1).replace('\n',''))
    seq_2       = iter(re_2.search(alignment).group(1).replace('\n',''))

    # GENERATOR BLOCK:
    rm_gaps     = ((i, j) for (i, j) in it.izip(seq_1, seq_2) if i != '-')  # remove gaps
    all_sites   = enumerate(rm_gaps, 1)  # enumerate (1-based)
    valid_sites = ((pos, i, j) for (pos, (i, j)) in all_sites if set(i+j).issubset('ACGT'))  # all sites where both sequences are valid

    for (pos, i, j) in valid_sites:

        valid += [pos]
        if mycmp(i,j):
            mismatch += [pos]

    return valid, mismatch

EDIT: by popular demand, here is a link to one of the files for people who would like to test their code: https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0

Are you aware of [difflib.SequenceMatcher](https://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher)? — roippi, Jul 29 '15 at 01:16
In your example ``3`` is both in ``valid`` and ``mismatch``. Is this a mistake or am I misunderstanding the requirements? — pzp, Jul 29 '15 at 01:29
@pzp double check the paragraph describing the conditions: valid is any site in the alignment where they are in the set `ACGT` whether they match or not, `mismatch` is a **subset** of this where they strictly **don't** match — isosceleswheel, Jul 29 '15 at 01:34
@AnandSKumar yes they are always identical length after removing the header and '\n' — isosceleswheel, Jul 29 '15 at 01:34
@roippi I am taking a look now, and I hadn't been aware of it before you mentioned it! — isosceleswheel, Jul 29 '15 at 01:36
@roippi It looks like a lot of the returns for `difflib` would be messy and more verbose than I am looking for though. Ideally I want a two vectors of indices or maybe two vectors of bools for the two conditions, something lightweight considering the big dataset... — isosceleswheel, Jul 29 '15 at 01:49

Patrick Maupin · Answer 1 · 2015-07-29T05:52:59.607

Reading your code, I can tell you're a smart fellow, so I'm going to give you something completely untested, and let you figure out how to make it work and whether or not any of it goes faster than what you already have :-)

(Hey, it's not like you gave a realistic dataset in your question anyway...)

EDIT -- use hex numbers for figuring out the mismatches.

#! /usr/bin/env python2.7

# Text is much faster in 2.7 than 3...

def rm_gaps(seq1, seq2):
    ''' Given a first sequence with gaps removed,
        do the same operation to a second sequence.
    '''
    startpos = 0
    for substring in seq1:
        length = len(substring)
        yield seq2[startpos:length]
        startpos += length + 1

def seq_divergence(infile):

    # Build a character translation map with ones
    # in the positions of valid bases, and
    # another one with hex numbers for each base.

    transmap_v = ['0'] * 256
    transmap_m = ['0'] * 256
    for ch, num in zip('ACGT', '1248'):
        transmap_v[ord(ch)] = '1'
        transmap_m[ord(ch)] = num
    transmap_v = ''.join(transmap_v)
    transmap_m = ''.join(transmap_m)


    # No need to do everything inside open -- you are
    # done with the file once you have read it in.
    with open(infile, 'rb') as f:
        alignment = f.read()

    # For this case, using regular string stuff might be faster than re

    h1 = '>HEADER1\n'
    h2 = h1.replace('1', '2')

    h1loc = alignment.find(h1)
    h2loc = alignment.find(h2)

    # This assumes header 1 comes before header 2.  If that is
    # not invariant, you will need more code here.

    seq1 = alignment[h1loc + len(h1):h2loc].replace('\n','')
    seq2 = alignment[h2loc + len(h2):].replace('\n','')

    # Remove the gaps
    seq1 = seq1.split('-')
    seq2 = rm_gaps(seq1, seq2)
    seq1 = ''.join(seq1)
    seq2 = ''.join(seq2)

    assert len(seq1) == len(seq2)

    # Let's use Python's long integer capability to
    # find locations where both sequences are valid.
    # Convert each sequence into a binary number,
    # and AND them together.
    num1 = int(seq1.translate(transmap_v), 2)
    num2 = int(seq2.translate(transmap_v), 2)
    valid = ('{0:0%db}' % len(seq1)).format(num1 & num2)
    assert len(valid) == len(seq1)


    # Now for the mismatch -- use hexadecimal instead
    # of binary here.  The 4 bits per character position
    # nicely matches our 4 possible bases.
    num1 = int(seq1.translate(transmap_m), 16)
    num2 = int(seq2.translate(transmap_m), 16)
    mismatch = ('{0:0%dx}' % len(seq1)).format(num1 & num2)
    assert len(match) == len(seq1)

    # This could possibly use some work.  For example, if
    # you expect very few invalid and/or mismatches, you
    # could split on '0' in both these cases, and then
    # use the length of the strings between the zeros
    # and concatenate ranges for valid, or use them as
    # skip distances for the mismatches.

    valid = [x for x, y in enumerate(valid,1) if y == '1']
    mismatch = [x for x, y in enumerate(mismatch, 1) if y == '0']

    return valid, mismatch

I haven't gotten the chance to try your solution yet but I **have** added a link to a real data set if you are interested :) — isosceleswheel, Jul 29 '15 at 14:31
I have also realized that two boolean mask-like arrays would be a more compact, simpler output for the downstream processing I have in mind. In other words, after removing the gaps, `valid = numpy.arrary([1 if (i, j).issubset('ACGT') else 0 for (i, j) in zip(seq1, seq2)], dtype='u1')` and `mismatch = numpy.array([1 if cmp(i,j) else 0 for (i, j) in zip(seq1, seq2)], dtype='u1')`. Note that although `mismatch` will capture some "invalid" alignments here `mismatch *= valid` will mask those out. — isosceleswheel, Jul 29 '15 at 14:46
I am extremely busy and have no time to play with this interesting problem, but I will note that if you are onboard with creating masks as I suggest (I think translate is really fast, and with ctypes you could reinterpret the translated strings as arrays of unsigned 8 bits) and if you are onboard with using numpy as the next answer suggests, then the numpy array cumulative sum (cumsum) method might remove the need for the squeeze step, because you can assign incrementing numbers only to those locations that don't have the '-' in them. Good luck! — Patrick Maupin, Jul 31 '15 at 13:52

score 1 · Answer 2 · answered Jul 29 '15 at 05:44

Here's my version (but you may start throwing stones at me for obfuscation) Please post some longer test-data, I'd like to test it...

seq1='CT-A-CG'
seq2='CNCTA*G'

import numpy as np
def is_not_gap(a): return 0 if (a=='-') else 1
def is_valid(a): return 1 if (a=='A' or a=='C' or a=='G' or a=='T' ) else 0
def is_mm(t): return 0 if t[0]==t[1] else 1

# find the indexes to be retained
retainx=np.where( np.multiply( map(is_not_gap, seq1), map(is_not_gap, seq2) ) )[0].tolist()

# find the valid ones
valid0=np.where(np.multiply( map( is_valid, seq1),map( is_valid, seq2))[retainx])[0].tolist()

# find the mismatches
mm=np.array(map( is_mm, zip( seq1,seq2)))
mismatch0=set(valid0) & set(np.where(mm[retainx])[0])

Results (zero-based indexing) :

 valid0
 [0, 2, 4]

 mismatch0
 {2}

(I can post a longer, more verbose version if you want)

Please see the latest edit: I have added a link to one of the files which you can test it on! — isosceleswheel, Jul 29 '15 at 14:32

How can I zip together two very large strings and return indices for matches and mismatches?

2 Answers2