I have been working on my bio-tech project and I have been stuck on this for a long.
Idea - Generating DNA sequences from a set of probabilities.
- For a sample, I took a given DNA string of length 128 and figured out conditional probabilities - the probability that C will follow A and so on.
I have generated probabilities for all possible combinations and now I have to rebuild a DNA sequence based on these probabilities.
For a clear Idea, below is my code in Java :
public class Generator {
static HashMap<String, Integer> countMap = new HashMap<>();
static int totalA = 0;
static int totalC = 0;
static int totalG = 0;
static int totalT = 0;
public static void learnChains(String sequence) {
// A followed by *
countMap.put("AT", sequence.split(Pattern.quote("AT"), -1).length - 1);
countMap.put("AA", sequence.split(Pattern.quote("AA"), -1).length - 1);
countMap.put("AG", sequence.split(Pattern.quote("AG"), -1).length - 1);
countMap.put("AC", sequence.split(Pattern.quote("AC"), -1).length - 1);
// C followed by *
countMap.put("CT", sequence.split(Pattern.quote("CT"), -1).length - 1);
countMap.put("CA", sequence.split(Pattern.quote("CA"), -1).length - 1);
countMap.put("CG", sequence.split(Pattern.quote("CG"), -1).length - 1);
countMap.put("CC", sequence.split(Pattern.quote("CC"), -1).length - 1);
// G followed by *
countMap.put("GT", sequence.split(Pattern.quote("GT"), -1).length - 1);
countMap.put("GA", sequence.split(Pattern.quote("GA"), -1).length - 1);
countMap.put("GG", sequence.split(Pattern.quote("GG"), -1).length - 1);
countMap.put("GC", sequence.split(Pattern.quote("GC"), -1).length - 1);
// T followed by *
countMap.put("TT", sequence.split(Pattern.quote("TT"), -1).length - 1);
countMap.put("TA", sequence.split(Pattern.quote("TA"), -1).length - 1);
countMap.put("TG", sequence.split(Pattern.quote("TG"), -1).length - 1);
countMap.put("TC", sequence.split(Pattern.quote("TC"), -1).length - 1);
// Print the map.
System.out.println(countMap);
for (Map.Entry<String, Integer> e : countMap.entrySet()) {
if (e.getKey().startsWith("A")) {
totalA += e.getValue(); // Let total[A] = count[AA] + count[AC] + count[AG] + count[AT]
}
if (e.getKey().startsWith("C")) {
totalC += e.getValue();
}
if (e.getKey().startsWith("G")) {
totalG += e.getValue();
}
if (e.getKey().startsWith("T")) {
totalT += e.getValue();
}
}
System.out.println(totalA);
System.out.println(totalC);
System.out.println(totalG);
System.out.println(totalT);
The output for the same follows as :
{AA=7, CC=9, GG=8, TT=3, AC=10, CG=10, AG=7, GT=8, TA=8, TC=5, CT=5, AT=9, TG=9, GA=12, GC=6, CA=5}
33
29
34
25
Here I am getting stuck :
I have to generate a random string one character at a time. Start with a random character. Suppose it is “A”. Generate the next character as follows:
choose next character “A” with probability count[AA]/total[A]
choose next character “C” with probability count[AC]/total[A]
choose next character “G” with probability count[AG]/total[A]
choose next character “T” with probability count[AT]/total[A]
If the next character is “T”, then you use count[T**]/total[T] as the probability of generating character ** next, and so on
I have been trying to generate the random strings using Math.random(), but haven't been successful yet.
Any help with this would be highly appreciated.