Rosalind "Mendel's First Law" IPRB

Question

As preparation for an upcoming bioinformatics course, I am doing some assignments from rosalind.info. I am currently stuck in the assignment "Mendel's First Law".

I think I could brute force myself through this, but that somehow my thinking must be too convoluted. My approach would be this:

Build a tree of probabilities which has three levels. There are two creatures that mate, creature A and creature B. First level is, what is the probability for picking as creature A homozygous dominant (k), heterozygous (m) or homozygous recessive (n). It seems that for example for homozygous dominant, since there are a total of (k+m+n) creatures and k of them are homozygous dominant, the probability is k/(k+m+n).

Then in this tree, under each of these would come the probability of creature B being k / m / n given that we know what creature A got picked as. For example if creature A was picked to be heterozygous (m), then the probability that creature B would also be heterozygous is (m-1)/(k+m+n-1) because there is now one less heterozygous creature left.

This would give the two levels of probabilities, and would involve a lot of code just to get this far, as I would literally be building a tree structure and for each branch have manually written code for that part.

enter image description here

Now after choosing creatures A and B, each of them has two chromosomes. One of these chromosomes can randomly be picked. So for A chromosome 1 or 2 can be picked and same for B. So there are 4 different options: pick 1 of A, 1 of B. Pick 2 of A, 1 of B. Pick 1 of A, 2 of B. Pick 2 of A, 2 of B. The probability of each of these would be 1/4. So finally this tree would have these leaf probabilities.

Then from there somehow by magic I would add up all of these probabilities to see what is the probability that two organisms would produce a creature with a dominant allele.

I doubt that this assignment was designed to take hours to solve. What am I thinking too hard?

Update:

Solved this in the most ridiculous brute-force way possible. Just ran thousands of simulated matings and figured out the portion that ended up having a dominant allele, until there was enough precision to pass the assignment.

import random
k = 26
m = 18
n = 25

trials = 0
dominants = 0

while True:
    s = ['AA'] * k + ['Aa'] * m + ['aa'] * n
    first = random.choice(s)
    s.remove(first)
    second = random.choice(s)
    has_dominant_allele = 'A' in [random.choice(first), random.choice(second)]
    trials += 1
    if has_dominant_allele:
        dominants += 1
    print "%.5f" % (dominants / float(trials))

you should ask https://www.biostars.org/ – Pierre Aug 04 '14 at 16:04 — Pierre, Aug 04 '14 at 16:04

Klaus-Dieter Warzecha · Answer 1 · 2014-08-08T05:19:21.293

Species with dominant alleles are either AA or Aa.

Your total ppopulation (k + n + m consists of k (hom) homozygous dominant organisms with AA, m (het) heterozygous dominant organisms with Aa and n (rec) homozygous recessive organisms with aa. Each of these can mate with any other.

The probability for organisms with the dominant allele is:

P_dom = n_dominant/n_total or 1 - n_recessive/n_total

Doing the Punnett squares for each of these combinations is not a bad idea:

  hom + het

  |  A | a
-----------
A | AA | Aa
a | Aa | aa


  het + rec

  |  a | a
-----------
A | Aa | Aa
a | aa | aa

Apparently, mating of of two organisms results in four possible children. hom + het yields 1 of 4 organisms with the recessive allele, het + rec yields 2 of 4 organisms with the recessive allele.

You might want to do that for the other combinations as well.

Since we're not just mating the organisms one on one, but throw together a whole k + m + n bunch, the total number of offspring and the number of 'children' with a particular allele would be nice to know.

If you don't mind a bit of Python, comb from scipy.misc might be helpful here. in the calculation, don't forget (a) that you get 4 children from each combination and (b) that you need a factor (from the Punnett squares) to determine the recessive (or dominant) offspring from the combinations.

Update

    # total population
    pop_total = 4 * comb(hom + het + rec, 2)

    # use PUNNETT squares!

    # dominant organisms         
    dom_total = 4*comb(hom,2) + 4*hom*het + 4*hom*rec + 3*comb(het,2) + 2*het*rec

    # probability for dominant organisms
    phom = dom_total/pop_total
    print phom

    # probability for dominant organisms + 
    # probability for recessive organisms should be 1
    # let's check that:
    rec_total = 4 * comb(rec, 2) + 2*rec*het + comb(het, 2)
    prec = totalrec/totalpop
    print 1 - prec

I can't get this to work (can't access 'from scipy.misc import comb') to test, however, I think this may be resulting in an incorrect answer *in this case* because the problem presents a limited population where drawing one individual from the mix limits later choices - i.e. not all hardy/weinberg conditions are met. In sample data, we are given 2,2,2 of AA,Aa,aa, respectively and are asked to compute the probability of getting a dominant phenotype offspring. H/W would give you 75%. Rosalind gives a correct answer of 78333% I'm at a loss with this question as well. — johntreml, Dec 03 '15 at 01:02

score 1 · Answer 2 · answered Jul 19 '17 at 03:32

This is more a probability/counting question than coding. It's easier to calculate the probability of an offspring having only recessive traits first. Let me know if you have any trouble understanding anything. I ran the following code and my output passed the rosalind grader.

def mendel(x, y, z):
    #calculate the probability of recessive traits only
    total = x+y+z
    twoRecess = (z/total)*((z-1)/(total-1))
    twoHetero = (y/total)*((y-1)/(total-1))
    heteroRecess = (z/total)*(y/(total-1)) + (y/total)*(z/(total-1))

    recessProb = twoRecess + twoHetero*1/4 + heteroRecess*1/2
    print(1-recessProb) # take the complement

#mendel(2, 2, 2)

with open ("rosalind_iprb.txt", "r") as file:
    line =file.readline().split()
    x, y, z= [int(n) for n in line]
    print(x, y, z)
file.close()
print(mendel(x, y, z))

score 1 · Answer 3 · answered Jul 27 '17 at 18:44

Klaus's solution has most of it correct; however, the error occurs when calculating the number of combinations that have at least one dominant allele. This part is incorrect, because while there are 4 possibilities when combining 2 alleles to form an offspring, only one possibility is actually executed. Therefore, Klaus's solution calculates a percentage that is markedly higher than it should be.

The correct way to calculate the number of combos of organisms with at least one dominant allele is the following:

# k  = number of homozygous dominant organisms
# n = number of heterozygous organisms
# m  = number of homozygous recessive organisms

dom_total = comb(k, 2) + k*m + k*n + .5*m*n + .75*comb(m, 2)

# Instead of:
# 4*comb(k,2) + 4*k*n + 4*k*m + 3*comb(n,2) + 2*n*m

The above code segment works for calculating the total number of dominant combos because it multiplies each part by the percentage (100% being 1) that it will produce a dominant offspring. You can think of each part as being the number of punnet squares for combos of each type (k&k, k&m, k&n, m&n, m&m).

So the entire correct code segment would look like this:

# Import comb (combination operation) from the scipy library 
from scipy.special import comb

def calculateProbability(k, m, n):
    # Calculate total number of organisms in the population:
    totalPop = k + m + n 
    # Calculate the number of combos that could be made (valid or not):
    totalCombos = comb(totalPop, 2)
    # Calculate the number of combos that have a dominant allele therefore are valid:
    validCombos = comb(k, 2) + k*m + k*n + .5*m*n + .75*comb(m, 2)
    probability = validCombos/totalCombos
    return probability

# Example Call: 
calculateProbability(2, 2, 2)
# Example Output: 0.783333333333

score 1 · Answer 4 · answered Sep 12 '17 at 08:08

You dont need to run thousands of simulations in while loop. You can run one simulation, and calculate probability from it results.

from itertools import product

k = 2  # AA  homozygous dominant
m = 2  # Aa  heterozygous
n = 2  # aa  homozygous recessive

population = (['AA'] * k) + (['Aa'] * m) + (['aa'] * n)

all_children = []
for parent1 in population:
    # remove selected parent from population.
    chosen = population[:]
    chosen.remove(parent1)
    for parent2 in chosen:
        # get all possible children from 2 parents. Punnet square
        children = product(parent1, parent2)
        all_children.extend([''.join(c) for c in children])

dominants = filter(lambda c: 'A' in c, all_children)
# float for python2
print float(len(list(dominants))) / len(all_children)
# 0.7833333

score 1 · Answer 5 · answered Apr 07 '22 at 13:50

Here I am adding my answer to explain it more clearly: We don't want the offspring to be completely recessive, so we should make the probability tree and look at the cases and the probabilities of the cases that event might happen. Then the probability that we want is 1 - p_reccesive. More explanation is provided in the comment section of the following code.

"""
Let d: dominant, h: hetero, r: recessive
Let a = k+m+n
Let X = the r.v. associated with the first person randomly selected
Let Y = the r.v. associated with the second person randomly selected without replacement
Then:
k = f_d => p(X=d) = k/a => p(Y=d| X=d) = (k-1)/(a-1) ,
                           p(Y=h| X=d) = (m)/(a-1) ,
                           p(Y=r| X=d) = (n)/(a-1)
                           
m = f_h => p(X=h) = m/a => p(Y=d| X=h) = (k)/(a-1) ,
                           p(Y=h| X=h) = (m-1)/(a-1)
                           p(Y=r| X=h) = (n)/(a-1)
                           
n = f_r => p(X=r) = n/a => p(Y=d| X=r) = (k)/(a-1) ,
                           p(Y=h| X=r) = (m)/(a-1) ,
                           p(Y=r| X=r) = (n-1)/(a-1)
Now the joint would be:
                            |    offspring possibilites given X and Y choice
-------------------------------------------------------------------------
X Y |  P(X,Y)               |   d(dominant)     h(hetero)   r(recessive)
-------------------------------------------------------------------------
d d     k/a*(k-1)/(a-1)     |    1               0           0
d h     k/a*(m)/(a-1)       |    1/2            1/2          0
d r     k/a*(n)/(a-1)       |    0               1           0
                            |
h d     m/a*(k)/(a-1)       |    1/2            1/2          0
h h     m/a*(m-1)/(a-1)     |    1/4            1/2         1/4
h r     m/a*(n)/(a-1)       |    0              1/2         1/2
                            |
r d     n/a*(k)/(a-1)       |    0               0           0
r h     n/a*(m)/(a-1)       |    0               1/2        1/2
r r     n/a*(n-1)/(a-1)     |    0               0           1

Here what we don't want is the element in the very last column where the offspring is completely recessive.
so P = 1 - those situations as follow
"""

path = 'rosalind_iprb.txt'

with open(path, 'r') as file:
    lines = file.readlines()
k, m, n = [int(i) for i in lines[0].split(' ')]
a = k + m + n
p_recessive = (1/4*m*(m-1) + 1/2*m*n + 1/2*m*n + n*(n-1))/(a*(a-1))
p_wanted = 1 - p_recessive
p_wanted = round(p_wanted, 5)
print(p_wanted)

score 0 · Answer 6 · edited Nov 18 '18 at 22:21

I just found the formula for the answer. You have 8 possible mating interactions that can yield a dominant offspring:

DDxDD, DDxDd, DdxDD, DdxDd, DDxdd, ddxDD, Ddxdd, ddxDd

With the respective probabilities of producing dominant offspring of:

1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 0.5, 0.5

Initially it seemed odd to me that DDxdd and ddxDD were two separate mating events, but if you think about it they are slightly different conceptually. The probability of DDxdd is k/(k+m+n) * n/((k-1)+m+n) and the probability of ddxDD is n/(k+m+n) * k/(k+m+(n-1)). Mathematically these are identical, but speaking from a probability stand point these are two separate events. So your total probability is the sum of the probabilities of each of these different mating events multiplied by the probability of that mating event producing a dominant offspring. I won't simplify it here step by step but that gives you the code:

total_probability = ((k ** 2 - k) + (2 * k * m) + (3 / 4 * (m ** 2 - m)) + (2* k * n) + (m * n)) / (total_pop ** 2 - total_pop)

All you need to do is plug in your values of k, m, and n and you'll get the probability they ask for.

score 0 · Answer 7 · answered Jan 20 '19 at 13:21

I doubt that this assignment was designed to take hours to solve. What am I thinking too hard?

I also had the same question. After reading the whole thread, I came up with the code.

I hope the code itself will explain the probability calculation:

def get_prob_of_dominant(k, m, n):
    # A - dominant factor
    # a - recessive factor
    # k - amount of organisms with AA factors (homozygous dominant)
    # m - amount of organisms with Aa factors (heterozygous)
    # n - amount of organisms with aa factors (homozygous recessive)
    events = ['AA+Aa', 'AA+aa', 'Aa+aa', 'AA+AA', 'Aa+Aa', 'aa+aa']

    # get the probability of dominant traits (set up Punnett square)
    punnett_probabilities = {
        'AA+Aa': 1,
        'AA+aa': 1,
        'Aa+aa': 1 / 2,
        'AA+AA': 1,
        'Aa+Aa': 3 / 4,
        'aa+aa': 0,
    }
    event_probabilities = {}
    totals = k + m + n

    # Event: AA+Aa -> P(X=k, Y=m) + P(X=m, Y=k):
    P_km = k / totals * m / (totals - 1)
    P_mk = m / totals * k / (totals - 1)
    event_probabilities['AA+Aa'] = P_km + P_mk

    # Event: AA+aa -> P(X=k, Y=n) + P(X=n, Y=k):
    P_kn = k / totals * n / (totals - 1)
    P_nk = n / totals * k / (totals - 1)
    event_probabilities['AA+aa'] = P_kn + P_nk

    # Event: Aa+aa -> P(X=m, Y=n) +P(X=n, Y=m):
    P_mn = m / totals * n / (totals - 1)
    P_nm = n / totals * m / (totals - 1)
    event_probabilities['Aa+aa'] = P_mn + P_nm

    # Event: AA+AA -> P(X=k, Y=k):
    P_kk = k / totals * (k - 1) / (totals - 1)
    event_probabilities['AA+AA'] = P_kk

    # Event: Aa+Aa -> P(X=m, Y=m):
    P_mm = m / totals * (m - 1) / (totals - 1)
    event_probabilities['Aa+Aa'] = P_mm

    # Event: aa+aa -> P(X=n, Y=n) + P(X=n, Y=n) = 0 (will be * 0, so just don't use it)
    event_probabilities['aa+aa'] = 0

    # Total probability is the sum of (prob of dominant factor * prob of the event)
    total_probability = 0
    for event in events:
        total_probability += punnett_probabilities[event] * event_probabilities[event]
    return round(total_probability, 5)

Rosalind "Mendel's First Law" IPRB

7 Answers7

Linked