I've extracted a list of DNA sequences from a FASTQ file and its currently being stored in a list for now.
sequences = ['ATCT','ATTT','ACGG','ACCG','ACGT','AGGT','ATGC','ATCC','AGTT']
I want to cluster sequences into a list of tuples so that each sequence within the tuple are similar to each other.
I initially considered using Hamming Distance (code shown below) for clustering similar sequences together but if I understand Hamming distance correctly, I would have to use a one of the sequences as a reference to compare Hamming distance against another sequence and depending on what my reference sequence is, I would end up with different clusters (e.g.: if I use 'ACGG' as my reference sequence and compare it to other sequences in the list, 'ACCG' and 'ACGT' would be clustered with 'ACGG' if I chose a Hamming distance of 1 as my criteria. Conversely, if I use 'ACCG' as my reference sequence, 'ACGG' would get clustered with it but not 'ACGT' since 'ACGT' has a hamming distance of 2 when compared with 'ACCG'.)
def hamming_dist(sequence1, sequence2):
assert len(sequence1) == len(sequence2)
return sum(sequence1 !=sequence2 for sequence1,sequence2 in itertools.izip(sequence1,sequence2))
With the drawback of using Hamming distance, I've considered using Levenshtein ratio (from the Levenshtein module) or Fuzzywuzzy ratio (from the Fuzzywuzzy module) to cluster the sequences.
What are the advantages/disadvantages of each method and how should I set up my code to cluster similar sequences together?