I was working on a problem through Rosalind. It's been a week since I've been stuck. I'm going to try to explain this as simply as possible.
Input - A string called Genome
, and integers k
, L
, and t
. A genome is a string of genetic code to sort through.
k
is a given integer, the size of each kmer. A kmer is a substring of the genetic code that might have some meaning. t
is the number of times a kmer appears in appears in a clump. L
is the length of text that forms a clump. For example, if L = 400
we are looking for a kmer that occurs t
times within a clump of 400 characters.
Output - All distinct k-mers forming (L, t)-clumps in Genome.
This code takes the genome, breaks it up into all possible kmers and inserts those kmers into a dictionary. The kmers are keys. The values are set up like
[frequency_of_kmer, [kmer locations]]
. That value is stored like this in the dictionary: {'AAAAA' : [y, [z1,z2]]}
, where y
is the number of occurrences, and z1
and z2
are the indices in the string where the substring is found.
Basically, I am looking to iterate over the dictionary. I want to find the keys that occur t
number of times within the text. That is, I want to find all keys of a dictionary d
such that d[key]==t
.
Code below, followed by the output.
Code:
from pprint import pprint
genome = "CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC"
k = 5
L = 75
t = 4
len_genome = int(len(genome))
l = []
for i in range (len_genome - k + 1):
kmer = genome[i:i +k]
# list of every possible kmer
l.append(kmer)
d = {}
for i in range (len(l)):
try:
d[l[i]][0] += 1
d[l[i]][1].append(i)
except KeyError:
d[l[i]] = [1, [i]]
pprint(d)