I have a set of text files which contain two very large sets of characters of identical length. The sets of characters are DNA sequences so I'm going to call them seq_1
and seq_2
and together they are called an alignment
. Files look like this:
>HEADER1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The possible characters in sequence 1 under >HEADER1
are ACGTN-
and the possible characters in sequence 2 under >HEADER2
are ACGTN-*
.
I want to analyze the sequences and return two lists of indices which I will call valid
and mismatch
.
valid
contains all (1-based) indices where positions in both sequences (the "alignment") are in the set ACGT
; mismatch
contains all (1-based) indices where positions in the alignment are in the set ACGT
but do not match one another. Thus mismatch
is a subset of valid
.
A final condition is that I do NOT increment the index counter at positions where sequence 1 is "-"
because these are considered "gaps" in the coordinate system I am using.
This example shows an alignment my expected output:
seq_1 = 'CT-A-CG' # total length = 5 because of the two gaps
seq_2 = 'CNCTA*G'
valid = [1,3,5] # all indices where both sequences are in 'ACGT' without counting gaps
mismatch = [3] # subset of valid where alignment does not match
I am looking to improve my current code (below), which involves regex extraction of the sequences, zipping and enumeration of non-gap sites into a generator object and then --the main time consuming step-- for-looping through this generator and filling the two lists. I feel like there must be an array-based or itertools
solution to this problem that would be much more efficient than a for-loop through the sequences and their index zipped together and I am looking for suggestions.
Code:
def seq_divergence(infile):
re_1 = re.compile('>HEADER1\n([^>]+)', re.MULTILINE)
re_2 = re.compile('>HEADER2\n([^>]+)', re.MULTILINE)
valid = []
mismatch = []
mycmp = cmp # create local cmp for faster calling in the loop
with open(infile, 'rb') as f:
alignment = f.read()
# get sequences and remove header and newlines
seq_1 = iter(re_1.search(alignment).group(1).replace('\n',''))
seq_2 = iter(re_2.search(alignment).group(1).replace('\n',''))
# GENERATOR BLOCK:
rm_gaps = ((i, j) for (i, j) in it.izip(seq_1, seq_2) if i != '-') # remove gaps
all_sites = enumerate(rm_gaps, 1) # enumerate (1-based)
valid_sites = ((pos, i, j) for (pos, (i, j)) in all_sites if set(i+j).issubset('ACGT')) # all sites where both sequences are valid
for (pos, i, j) in valid_sites:
valid += [pos]
if mycmp(i,j):
mismatch += [pos]
return valid, mismatch
EDIT: by popular demand, here is a link to one of the files for people who would like to test their code: https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0