0

I am looking for a way to find the total number of mismatches between two strings in python. My input is a list that looks like this

['sequence=AGATGG', 'sequence=AGCTAG', 'sequence=TGCTAG',
 'sequence=AGGTAG', 'sequence=AGCTAG', 'sequence=AGAGAG']

and I for each string, I want to see how many differences it would have from the sequence "sequence=AGATAA". so if the input was the [0] from the list above, the output would read like this:

sequence=AGATGG, 2

I cannot figure out whether to split each of the letters into individual lists or if I should try and compare the whole string somehow. Any help is useful, thanks

Savir
  • 17,568
  • 15
  • 82
  • 136
  • What do you mean *"differences"*? Just pairwise character comparison, or e.g. http://en.wikipedia.org/wiki/Levenshtein_distance, or...? – jonrsharpe Nov 24 '14 at 16:32
  • Define a method whose loops from 0 to the number of chars of your string. For each char, increment a counter if the current char from reference string is different from the char at the same index in the checked string. When you have done this work, your counter have the exact number of differences of the 2 strings. – Antwane Nov 24 '14 at 16:33

3 Answers3

5

You can easily compute the total number of pairwise mismatches between two strings using sum and zip:

>>> s1='AGATGG'
>>> s2='AGATAA'
>>> sum(c1!=c2 for c1,c2 in zip(s1,s2))
2

if you have to deal with strings which are not of the same size, you might want to prefer from itertools import zip_longest instead of zip

ch3ka
  • 11,792
  • 4
  • 31
  • 28
2

First of all, I think your safest bet it to use Levenshtein distance with some library. But since you are tagging with Biopython, you can use pairwise:

  1. First you want to get rid of the "sequence=". You can slice each string or

    seqs = [x.split("=")[1] for x in ['sequence=AGATGG',
                                      'sequence=AGCTAG',
                                      'sequence=TGCTAG',
                                      'sequence=AGGTAG',
                                      'sequence=AGCTAG',
                                      'sequence=AGAGAG']]
    
  2. Now define the reference sequence:

    ref_seq = "AGATAA"
    
  3. And using pairwise you can calculate the alignment:

    from Bio import pairwise2
    
    for seq in seqs:
        print pairwise2.align.globalxx(ref_seq, seq)
    

I'm using pairwise2.align.globalxx that is alignment without parameters. Other functions accept different values for matches and gaps. Check them at http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html.

xbello
  • 7,223
  • 3
  • 28
  • 41
1

See Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance.

You'll find a large number of python libraries that implement this algorithm efficiently.

I believe it is more appropriate for comparing such gene sequences (since it also handles inserts and deletions well).

GeneralBecos
  • 2,476
  • 2
  • 22
  • 32