Dictionary comprehension with multiple values for each key

Question

Im doing a course in bioinformatics. We were supposed to create a function that takes a list of strings like this:

    Motifs =[
    "AACGTA", 
    "CCCGTT", 
    "CACCTT", 
    "GGATTA", 
    "TTCCGG"]

and turn it into a count matrix that counts the occurrence of the nucleotides (the letters A, C, G and T) in each column and adds a pseudocount 1 to it, represented by a dictionary with multiple values for each key like this:

   count ={
    'A': [2, 3, 2, 1, 1, 3], 
    'C': [3, 2, 5, 3, 1, 1], 
    'G': [2, 2, 1, 3, 2, 2], 
    'T': [2, 2, 1, 2, 5, 3]}

For example A occurs 1 + 1 pseudocount = 2 in the first column. C appears 2 + 1 pseudocount = 3 in the fourth column.

Here is my solution:

def CountWithPseudocounts(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    count = {}
    for symbol in "ACGT":
        count[symbol] = [1 for j in range(k)]
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    return count

The first set of for loops generates a dictionary with the keys A,C,G,T and the initial values 1 for each column like this:

   count ={
    'A': [1, 1, 1, 1, 1, 1], 
    'C': [1, 1, 1, 1, 1, 1], 
    'G': [1, 1, 1, 1, 1, 1], 
    'T': [1, 1, 1, 1, 1, 1]}

The second set of for loops counts the occurrence of the nucleotides and adds it to the values of the existing dictionary as seen above.

This works and does its job, but I want to know how to further compress both for loops using dict comprehensions.

NOTE: I am fully aware that there are a multitude of modules and libraries like biopython, scipy and numpy that probably can turn the entire function into a one liner. The problem with modules is that their output format often doesnt match with what the automated solution check from the course is expecting.

score 2 · Answer 1 · answered Nov 12 '21 at 11:13

2

This

count = {}
for symbol in "ACGT":
    count[symbol] = [1 for j in range(k)]

can be changed to comprehension as follows

count = {symbol:[1 for j in range(k)] for symbol in "ACGT"}

and then further simplified by using pythons ability to multiply list by integer to

count = {symbol:[1]*k for symbol in "ACGT"}

answered Nov 12 '21 at 11:13

Daweo

31,313
3
12
25

So you basically add ones k times to each key? – Peter Wohlfarth Nov 12 '21 at 11:35
I do multiply one-element list `[1]` by `k` in order to get list with `k` `1`s – Daweo Nov 12 '21 at 12:26

score 1 · Answer 2 · answered Nov 12 '21 at 11:04

compressing the first loop:

count = {symbol: [1 for j in range(k)] for symbol in "ACGT"}

This method is called a generator (or dict comprehension) - it generates a dict using a for loop.

I'm not sure you can compress the second (nested) loop, since it's not generating anything, but changing the first dict.

score 1 · Answer 3 · answered Nov 12 '21 at 11:09

1

You can compress a lot your code using collections.Counter and collections.defaultdict:

from collections import Counter, defaultdict

out = defaultdict(list)
bases = 'ACGT'

for m in zip(*Motifs):
    c = Counter(m)
    for b in bases:
        out[b].append(c[b]+1)
dict(out)

output:

{'A': [2, 3, 2, 1, 1, 3],
 'C': [3, 2, 5, 3, 1, 1],
 'G': [2, 2, 1, 3, 2, 2],
 'T': [2, 2, 1, 2, 5, 3]}

answered Nov 12 '21 at 11:09

mozway

194,879
13
39
75

Works and looks pretty nice. But can you tell me what exactly you get when you unzip the Motifs? Im not that familiar with zip() yet. – Peter Wohlfarth Nov 12 '21 at 11:31
Here `zip` is used to get the bases position by position, it sort of shifts rows into columns if you want ;) you should check the output of `list(zip(*Motifs))` – mozway Nov 12 '21 at 11:35

score 1 · Answer 4 · answered Nov 12 '21 at 12:42

You can use collections.Counter:

from collections import Counter
m = ['AACGTA', 'CCCGTT', 'CACCTT', 'GGATTA', 'TTCCGG']
d = [Counter(i) for i in zip(*m)]
r = {a:[j.get(a, 0)+1 for j in d] for a in 'ACGT'}

Output:

{'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2], 'T': [2, 2, 1, 2, 5, 3]}

Dictionary comprehension with multiple values for each key

4 Answers4