0

I have been working on my bio-tech project and I have been stuck on this for a long.
Idea - Generating DNA sequences from a set of probabilities.
- For a sample, I took a given DNA string of length 128 and figured out conditional probabilities - the probability that C will follow A and so on.
I have generated probabilities for all possible combinations and now I have to rebuild a DNA sequence based on these probabilities. For a clear Idea, below is my code in Java :

public class Generator {
    static HashMap<String, Integer> countMap = new HashMap<>();
    static int totalA = 0;
    static int totalC = 0;
    static int totalG = 0;
    static int totalT = 0;

    public static void learnChains(String sequence) {
        // A followed by *
        countMap.put("AT", sequence.split(Pattern.quote("AT"), -1).length - 1);
        countMap.put("AA", sequence.split(Pattern.quote("AA"), -1).length - 1);
        countMap.put("AG", sequence.split(Pattern.quote("AG"), -1).length - 1);
        countMap.put("AC", sequence.split(Pattern.quote("AC"), -1).length - 1);

        // C followed by *
        countMap.put("CT", sequence.split(Pattern.quote("CT"), -1).length - 1);
        countMap.put("CA", sequence.split(Pattern.quote("CA"), -1).length - 1);
        countMap.put("CG", sequence.split(Pattern.quote("CG"), -1).length - 1);
        countMap.put("CC", sequence.split(Pattern.quote("CC"), -1).length - 1);

        // G followed by *
        countMap.put("GT", sequence.split(Pattern.quote("GT"), -1).length - 1);
        countMap.put("GA", sequence.split(Pattern.quote("GA"), -1).length - 1);
        countMap.put("GG", sequence.split(Pattern.quote("GG"), -1).length - 1);
        countMap.put("GC", sequence.split(Pattern.quote("GC"), -1).length - 1);

        // T followed by *
        countMap.put("TT", sequence.split(Pattern.quote("TT"), -1).length - 1);
        countMap.put("TA", sequence.split(Pattern.quote("TA"), -1).length - 1);
        countMap.put("TG", sequence.split(Pattern.quote("TG"), -1).length - 1);
        countMap.put("TC", sequence.split(Pattern.quote("TC"), -1).length - 1);

        // Print the map.
        System.out.println(countMap);

        for (Map.Entry<String, Integer> e : countMap.entrySet()) {
            if (e.getKey().startsWith("A")) {
                totalA += e.getValue(); // Let total[A] = count[AA] + count[AC] + count[AG] + count[AT]
            }
            if (e.getKey().startsWith("C")) {
                totalC += e.getValue();
            }
            if (e.getKey().startsWith("G")) {
                totalG += e.getValue();
            }
            if (e.getKey().startsWith("T")) {
                totalT += e.getValue();
            }
        }
        System.out.println(totalA);
        System.out.println(totalC);
        System.out.println(totalG);
        System.out.println(totalT);


The output for the same follows as :

{AA=7, CC=9, GG=8, TT=3, AC=10, CG=10, AG=7, GT=8, TA=8, TC=5, CT=5, AT=9, TG=9, GA=12, GC=6, CA=5}
33
29
34
25


Here I am getting stuck :
I have to generate a random string one character at a time. Start with a random character. Suppose it is “A”. Generate the next character as follows:


choose next character “A” with probability count[AA]/total[A]
choose next character “C” with probability count[AC]/total[A]
choose next character “G” with probability count[AG]/total[A]
choose next character “T” with probability count[AT]/total[A]

If the next character is “T”, then you use count[T**]/total[T] as the probability of generating character ** next, and so on


I have been trying to generate the random strings using Math.random(), but haven't been successful yet.
Any help with this would be highly appreciated.

anony_std
  • 29
  • 6
  • To be clear does this the count[AC] ‘ choose next character “C” with probability count[AC]/total[A] ’ means the number of substring AC we have in the current string we are building? – Rohan Sharma Nov 24 '20 at 17:31
  • Yes. The current string from where we are learning has substring AC, as you can see in output, the hashmap has the number of counts that AC is appearing in sequence. Now to generate a new random string, if first character is A, then we have to use conditional probability associated with A i.e AT,AC,AG,AA and generate next character. – anony_std Nov 24 '20 at 17:35
  • Your notations are unclear. Can you clarify what this means for instance: "choose next character “A” with probability count[AA]/total[A]"? – Sathimantha Malalasekera Nov 24 '20 at 20:50
  • @Cenfracee I have clearly mentioned in the code as well as in the description that count[AA] is the number of times substring AA appears in the sequence and total A is the sum of countAA+AC+AT+AG i.e all substrings that start will A followed by rest 4 characters. – anony_std Nov 24 '20 at 21:28

0 Answers0