-1

I was working on a problem through Rosalind. It's been a week since I've been stuck. I'm going to try to explain this as simply as possible.

Input - A string called Genome, and integers k, L, and t. A genome is a string of genetic code to sort through.

k is a given integer, the size of each kmer. A kmer is a substring of the genetic code that might have some meaning. t is the number of times a kmer appears in appears in a clump. L is the length of text that forms a clump. For example, if L = 400 we are looking for a kmer that occurs t times within a clump of 400 characters.

Output - All distinct k-mers forming (L, t)-clumps in Genome.

This code takes the genome, breaks it up into all possible kmers and inserts those kmers into a dictionary. The kmers are keys. The values are set up like [frequency_of_kmer, [kmer locations]]. That value is stored like this in the dictionary: {'AAAAA' : [y, [z1,z2]]}, where y is the number of occurrences, and z1 and z2 are the indices in the string where the substring is found.

Basically, I am looking to iterate over the dictionary. I want to find the keys that occur t number of times within the text. That is, I want to find all keys of a dictionary d such that d[key]==t.

Code below, followed by the output.

Code:

from pprint import pprint
genome = "CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC"
k = 5
L = 75
t = 4
len_genome = int(len(genome))
l = []

for i in range (len_genome - k + 1):
    kmer = genome[i:i +k]
    # list of every possible kmer
    l.append(kmer)

d = {}
for i in range (len(l)):
    try:
        d[l[i]][0] += 1
        d[l[i]][1].append(i)
    except KeyError:
        d[l[i]] = [1, [i]]

pprint(d)
Dan
  • 12,157
  • 12
  • 50
  • 84
  • 5
    Try asking the question like a programming question and not like a biology question. Hint: Don't use the words `genome`, `kmer` or `clump` – Slater Victoroff Jan 02 '14 at 22:08
  • Can you add what's wrong with the code? i.e. what's the specifically wrong with the output, what should the output be? Also check you indenting. – Alexandru Chirila Jan 02 '14 at 22:08
  • Your indentation is broken. This is a consequence of using tabs, instead of spaces. Not wanting to start a holy-war here, so either change your editor setting, or fix the indentation here :) – BartoszKP Jan 02 '14 at 22:17
  • 1
    What exactly do you mean by "clump"? Is it just the whole string contained in `genome`? – Dan Jan 02 '14 at 22:33

3 Answers3

1

Edit: If I understand you correctly, this can be achieved pretty easy:

from pprint import pprint
x = 4
pprint({key: value for key, value in d.iteritems() if value[0] == x})

Output:

{'AATGT': [4, [21, 73, 81, 86]],
 'CGACA': [4, [6, 54, 59, 67]],
 'GAAGA': [4, [16, 26, 37, 42]]}

(original answer below)

I don't know what a clump is, but this is how you access, say the second integer in 'AATGT' (the 7th dict item, which is 73):

d['AATGT'][1][1]

['AATGT'] get the value of the key 'AATGT', the first [1] access the second item in the outermost list, and the second [1] access the second value in the innermost list.

This yields 73 as expected.

If you want to iterate over all these values, you can use a double for loop:

# d.iteritems() should be d.items() in Python 3.x
for key, sublist in d.iteritems():
    print('kmer: {}'.format(key))
    for value in sublist[1]:
        print value

This yields

kmer: ACACG
51
56
kmer: TAATG
72
85
kmer: AGAGG
44
kmer: GGACT
1
(...)
Steinar Lima
  • 7,644
  • 2
  • 39
  • 40
0

If I am understanding you correctly, you need the list of all of the kmers, which are the keys of the dictonary d.

To get all of the keys of a dictionary, you can use the keys() method of the dictionary class, like so:

kmer_list=d.keys()

If you want to find all the sequences that occur a certain number of times, try:

occurance_times=4
kmer_list=filter(lambda x: d[x][0]==occurance_times, d.keys())
Dan
  • 12,157
  • 12
  • 50
  • 84
0
    from collections import defaultdict
    code="AGCTTTT...TTTTTC"
    (k,L,t,counter,results) = (9,500,3,1,[])
    d = defaultdict(list)
    for z in range (0,len(code)):
      d[code[z:z+k]].append(z)
    for value in d.items():
      if len(value[1])>=3:
        for y in range(0,len(value[1])-2):
          if value[1][y+t-1]-value[1][y] <= L-k:
            results.append(value[0])
    results.sort()
    if len(results)==0:
      print "No result"
    if len(results)==1:
      print results[0],
      print 1
    if len(results)>1:
      print results[0],
      for i in range (0,len(results)-1):
        if results[i+1]!=results[i]:
          counter += 1
          print results[i+1],
    print counter
Stefan Gruenwald
  • 2,582
  • 24
  • 30