3

I got an challenge to make an algorithm in java that calculates how much possible DNA chains can an string form. String can contain these 5 characters (A, G, C, T, ?)

? in the string can be (A, G, C or T) but ? may not cause an pair in the string. For example, in this string "A?G" ? could only be C or T. There can be infinite pair of question marks, since they are all characters in the end.

The function form is this

public static int chains(String base) {
    // return the amount of chains
}

if the base string would be "A?C?" possible combinations would be 6 = (AGCA, AGCG ,AGCT ,ATCA ,ATCG ,ATCT)

Cases (??? - 36) (AGAG - 1) (A???T - 20)

(? - 4) (A? - 3) (?A - 3) (?? - 12) (A?A - 3) (A?C - 2) ...

Max length of the given base(pohja) string is 10!

Criteria: 1. Combinations that have two characters in a row are are illegal combinations so those don't count.

What I have so far:

    public static int chains(String pohja) {
    int sum = 1;
    int length = pohja.length();
    char[] arr = pohja.toCharArray();
    int questionMarks = 0;


    if (length == 1) {
        if (pohja.equals("?"))
            return 4;
        else
            return 1;
    } else if (length == 2) {
        boolean allQuestionMarks = true;
        for (int i = 0; i < 2; i++) {
            if (arr[i] != '?')
                allQuestionMarks = false;
            else
                questionMarks++;
        }

        if (allQuestionMarks) return 12;
        if (questionMarks == 1) {
            return 3;
        } else {
            return 2;
        }
    } else {
        questionMarks = 0;

        for (int i = 0; i < length; i++) {
            if (arr[i] == '?') questionMarks++;
        }

        for (int i = 1; i < length - 1; i++) {
            boolean leftIsLetter = isLetter(arr[i - 1]);
            boolean rightIsLetter = isLetter(arr[i + 1]);
            boolean sameSides = false;

            if (arr[i - 1] == arr[i + 1]) sameSides = true;

            if (arr[i] != '?') { // middle is char
                if (leftIsLetter && rightIsLetter) { // letter(left) == letter(right)
                    if (sameSides) {
                        // Do nothing!
                    } else {
                        sum *= 3;
                    }
                } else if (!leftIsLetter && !rightIsLetter) { // !letter(left) == !letter(right)

                } else { // letter(left) != letter(right)

                }
            } else { // Middle is ?
                if (leftIsLetter && rightIsLetter) { // letter(left) == letter(right)
                    if (sameSides) {
                        sum *= 3;
                    } else {
                        sum *= 2;
                    }
                } else if (!leftIsLetter && !rightIsLetter) { // !letter(left) == !letter(right)
                    sum *= 9;
                } else { // letter(left) != letter(right)
                    if (arr[i - 1] == '?') { // ? is on the left

                    } else { // ? is on the right
                        sum *= 2;
                    }
                }
            }
        }
    }

    return sum;
}

public static boolean isLetter(char c) {
    boolean isLetter = false;
    char[] dna = { 'A', 'G', 'C', 'T' };

    for (int i = 0; i < 4; i++) {
        if (c == dna[i]) isLetter = true;
    }

    return isLetter;
}

Yeah, I know, my code is a mess. If the length of pohja(base) is 3 or more, my algorithm will check 3 characters at a time and modify sum depending on the characters that the algorithm is checking.

Could anyone give an hint on how I can solve this? :) Thanks in advance, TuukkaX.

Yytsi
  • 424
  • 1
  • 5
  • 14
  • 1
    Can you show the code/work you have done until now? – burglarhobbit Sep 15 '15 at 17:30
  • @Mr.Robot Fine. Just a second. – Yytsi Sep 15 '15 at 17:31
  • is 4 the max length for the input string? You are looking for the number of possible combinations or for the actual combinations too? – Luigi Cortese Sep 15 '15 at 17:36
  • @LuigiCortese It can be infinite, but 10 is max in the challenge I'm facing. – Yytsi Sep 15 '15 at 17:37
  • @LuigiCortese I'm looking for all the possible combinations you can form from the given string with specified criteria that I have mentioned. with those question marks on the play, it becomes more difficult and more combinations appear. Let me make it little more clear in the post. – Yytsi Sep 15 '15 at 17:41
  • The edge cases you listed are probably unnecessary if you refactor the code slightly, with a more straightforward solution. – Sh4d0wsPlyr Sep 15 '15 at 17:48
  • @Sh4d0wsPlyr I know :/ I'm sure there is an much more straightforward & working solution but I just don't realize it right now. – Yytsi Sep 15 '15 at 17:49
  • Currently your post literally doesn't contain any other question than: please fix this code for me. That's not a concise question and therefore not a good fit for StackOverflow. – Maarten Bodewes Sep 15 '15 at 18:49
  • @MaartenBodewes I think you're wrong. See "Could anyone give an hint on how I can solve this?". – Yytsi Sep 15 '15 at 19:52
  • I'm glad you got an answer TuukkaX, but this is not a **specific** question nor is it an question that is of use to anybody else. – Maarten Bodewes Sep 15 '15 at 20:03
  • @MaartenBodewes I disagree. You may be right, this is not an specific question, but when tomorrow I realize what pandatyr told me or when he posts the code, depending on the code that he posts if he does, I will edit my question with the proper code that solved my problem, so that everyone else having a 'similar' problem may gain advantage of this. – Yytsi Sep 15 '15 at 20:10
  • That would only make matters worse. Questions are for questions, not for solutions. – Maarten Bodewes Sep 15 '15 at 20:23
  • @MaartenBodewes You're right, that was my mistake. – Yytsi Sep 15 '15 at 20:25

3 Answers3

5

Note: I will keep this somewhat vague since you only asked for a hint. If you would like me to iterate further, feel free to ask.

What you need to know in order to solve this mathematically is the amount of substitutions you can make at each sequence of question marks (i.e. in the base string "A?GCT???T?G", you'd have three sequences of question marks - two containing one question mark each, and one with three). In a situation like this, the total amount of substitutions you can have is equal to the product of the amount of substitutions you can make for each of the sequences.

Simple example: In the string "A?G?", the first question mark can be replaced by two characters, while the second one can be replaced by three. So overall, that's 2*3 = 6 legal possibilities.

The challenge in calculating the result like this lies in finding out how to calculate the amount of substitutions you can make for longer sequences of question marks. I'll give you one last tip and include the solution as a spoiler: The amount of legal substitution depends on the characters before and after the question marks. I'll leave finding out in which way up to you, though.

Clarification on that:

The amount of substitutions you can make depends on whether the characters before and after the question marks are equal or not. For instance, "A??A" has a total of 6 legal possibilities and "A??G" has 7. This needs to be taken into consideration.

And here's how to work that into a solution:

Now, how to solve something like "A????A"? Remember, total amount of substitutions = product of substitutions for each individual sequence. "A????A" is a sequence of four question marks, and the characters before and after them are equal. There are three legal possibilities to replace the second character, and each of them leaves "[G|C|T]???A" - as in, a sequence of three question marks with the previous and following character not being equal. You can keep doing this recursively to get a total amount of possible result strings. Keep in mind that question marks at the very start and end of the base string require special treatment.

In case you still can't work it out, I'll give you a possible header to a method to calculate the amount of legal substitutions for a sequence:

 private int calcSequenceSubs(int length, boolean prevFollEqual)

And this could be the body:

if (prevFollEqual){
    if (length == 1) return 3;
    else return 3 * calcSequenceSubs(length-1, false);
} else {
    if (length == 1) return 2;
    else return 2 * calcSequenceSubs(length-1, false) + calcSequenceSubs(length-1, true);
}

Edit (simplified version without spoilers):

The amount of legal solutions for the entire string is equal to the product of the amount of solutions for each sequence of question marks. For instance, "A?A?A" has two sequences of question marks and each of them has three legal substitutions, so the entire string has a total of 3*3 = 9 legal solutions.

So, what needs to be done is:

  1. Search the string for sequences of question marks
  2. Calculate the amount of possible solutions for each sequence
  3. Multiply all of these

The tricky part is actually caltulating the amount of legal substitutions for each of the sequences. These depend on two things: The length of the sequence (obviously) and whether the characters before and after the sequence are equal (a single question mark, for instance, has 3 possible outcomes when the previous and following character are equal and two otherwise).

Now, for longer sequences, the total amount of legal substitutions can be calculated resursively. For instance, "A??T" is a sequence of two question marks and the previous and following characters are not equal. The first question mark can be replaced by either T,G or C, resulting in either "T?T", "G?T" or "C?T". Two of those are sequences of one question mark where the previous and following character are not equal and one of them is a sequence of one question mark where the previous and following character are equal.

The pattern for the recursive algorithm is always the same - if the previous and following character of the sequence are not equal, two of the options result in a sequence where previous and following character are different and one where they're the same. Likewise, when the previous and following character in the original sequence were equal, all 3 of the options result in the next step being a sequence where previous and following character are different.

A code example of a possible solution:

public static int DNAChains(String base) {

if (base == null || base.length() == 0) {
    return 0;
}

int curSequence = 0;
int totalSolutions = 1;
boolean inSequence = false;
//flag to check whether there are any sequences present.
//if not, there is one solution rather than 0
char prevChar = 'x';
char follChar = 'y';
int i = 0;

char[] chars = base.toCharArray();

//handle starting sequence if present
while (i < chars.length && chars[i] == '?') {
    curSequence++;
    i++;
}

if (curSequence > 0) {

    //exclusively ?'s needs to be treated even differently
    if (i < chars.length) {
        //? at the edge can be anything, so 3*false, 1*true
        //if length is 1 though, there are just 3 solutions
        totalSolutions *= (curSequence > 1) ? 3 * solveSequence(curSequence - 1, false) + solveSequence(curSequence - 1, true) : 3;
        curSequence = 0;
    } else {
        //result is 4*3^(length-1)
        totalSolutions = 4* ((int) Math.pow(3, chars.length-1));
    }
}

//check for sequences of question marks
for (; i < chars.length; i++) {

    if (chars[i] == '?') {
        if (!inSequence) {
            inSequence = true;
            prevChar = chars[i - 1];

            //there is at least one sequence -> set flag
        }
        curSequence++;
    } else if (inSequence) {
        inSequence = false;
        follChar = chars[i];
        totalSolutions *= solveSequence(curSequence, prevChar == follChar);
        curSequence = 0;
    }

}

//check if last sequence ends at the edge of the string
//if it does, handle edge case like in the beginning
if (inSequence) {
    //? at the edge can be anything, so 3*false, 1*true
    //if length is 1 though, there are just 3 solutions
    totalSolutions *= (curSequence > 1) ? 3 * solveSequence(curSequence - 1, false) + solveSequence(curSequence - 1, true) : 3;
}

return totalSolutions;
}//end DNAChains

private static int solveSequence(int length, boolean prevFollEqual) {

if (prevFollEqual) {
    //anchor
    if (length == 1) {
        return 3;
    } else {
        return 3 * solveSequence(length - 1, false);
    }
} else {
    //anchor
    if (length == 1) {
        return 2;
    } else {
        return 2 * solveSequence(length - 1, false) + solveSequence(length - 1, true);
    }
}
}//end solveSequence

I didn't test this thoroughly, but it seems to work. I managed to deal with the edge cases as well (not 100% sure whether I got all of them, though).

Pandatyr
  • 284
  • 2
  • 8
  • Thanks for the answer! I'll try it ASAP if I can understand everything you told. – Yytsi Sep 15 '15 at 18:47
  • If something's unclear, feel free to ask. I feel like I made it seem more complicated than it actually is – Pandatyr Sep 15 '15 at 18:52
  • For using recursion, and for finding an edge case I did not see, +1. – Sh4d0wsPlyr Sep 15 '15 at 18:52
  • You know, I would probably understand the things you told there and work it out just like that, but since it's 9:54 here in Finland and I have tortured my brains with this problem for the last 6 hours if not more, I cannot figure it out even a bit. So could you simplify it a bit? :D Give more code if possible? Thanks so much! +1 – Yytsi Sep 15 '15 at 18:56
  • Fair enough, I'll add a solution in proper code formatting. – Pandatyr Sep 15 '15 at 19:10
  • 1
    @Pandatyr I changed your spoiler tags into quotes and code blocks. If you click "edit" you will be able to see the formatting behind the post. For instance, whereas spoilers start with `>!`, quotes start with `>` and code blocks are indented by 4 spaces on every line. Thanks! – Maximillian Laumeister Sep 15 '15 at 19:30
1

First, find all sections containing ?s:

Sections

Sections can fall into 4 categories depending on how are they surrounded by non-? genes:

  1. Same genes on both sides
  2. Different genes on both sides
  3. Gene on only 1 side
  4. No genes on either side

One can easily see that each chain can have maximum of 2 category-3 sequences and a category-4 sequence can only exist in a chain containing only ?s.

If you could compute how many ways can you fill each section, then you would only need to multiply these number and you are done. In this example, sections 1, 2 and 3 can be filled in 21, 7 and 81 ways, for a total of 21*7*81 = 11 907 ways to fill this chain.

How to compute how many ways to fill each section?

Let's start with categories 3 and 4, since they are way easier. For category 4 (chain full of ?s) we have 4*3^(n-1) ways to fill it (n is the length). Why? Because first gene can be any gene (4 choices) and all the others can be the other 3 except the one before them (3 choices).

For category 3, the result is 3^n. If the section is at the end (like in the example), we will fill from left to right, having 3 choices at each step. Same when section is at the begging, but we fill from right to left.

The problem is with categories 1 and 2. Let's define that S(n) is the number of ways how to fill a section of n ?'s surrounded by Same genes, and D(n) by Different ones. For n>=2 we can put these two into a relationship:

S(n) = 3*D(n-1)
D(n) = 2*D(n-1) + S(n-1)

relationship between S(n) and D(n)

You can implement these without any further investigation, but don't do it with simple recursion. Use lookup tables or something like that. Or you can use math skill to find the formulas for S(n) and D(n) without recursion:

By substitution you can get D(n) = 2*D(n-1) + 3*D(n-2) for n>=3. Also we know that D(1) = 2 and D(2) = 7 (computed by hand and paper). Equations like these can be resolved by several techniques, I have used Matrix Exponentiation from Linear Algebra. These are the results:

D(n) = (1/4)(3*3^n + (-1)^n)
S(n) = (3/4)(3^n - (-1)^n)
Community
  • 1
  • 1
kajacx
  • 12,361
  • 5
  • 43
  • 70
0

I have not done nearly enough work to be convinced with my answer, but if you are looking for a straightforward answer this "may" work. You will have to test and verify if something similar has been done. Similarly I have given some pseudo code to help better explain my idea.

Basically at every point in the chain you have a number of options, which will equate to the maximum number of possible combinations you can make. For instance the string "GC?" has the corresponding values "1 X 1 X 3 = 3 combinations". Similarly the value "A???T" has the values "1 X 3 X 3 X 2 X 1 = 18 combinations". Presuming of course my understanding in this is correct and I have no missed something obvious. This means you should be able to calculate the value at every point, given you know the prerequisites. So set up some rules for our code to follow as such.

  1. Any Character is automatically and always equal to 1.
  2. A question mark is between 3 and 2 depending on location.
  3. Any question mark in the middle of other question marks automatically has the value of 3.
  4. Any question mark constrained on only one side by a character is 3.
  5. A question mark constrained on both sides will have one value less than 3 (e.g. at least one question mark must be 2).
  6. Please note a special case for any substring of ?'s that start and end with the same character (e.g. A???A).

Note: I think that was all the cases. If anyone wants to try to confirm feel free.

So some pseudo code might look like...

int currentValue = 1;
for(each character) {
    if(character is fixed)
        currentValue *= 1
    else if (character is question mark)
        //... find the proper case, might have to create a look ahead function

I might recommend some sort of Boolean to track the left side of the equation (e.g. turn on a flag when you encounter a character, so you don't need to create a look-ahead or behind function).

Sh4d0wsPlyr
  • 948
  • 12
  • 28
  • I'll give it a try now. – Yytsi Sep 15 '15 at 18:06
  • I'll add more cases to my question so you'll get a better view at this. – Yytsi Sep 15 '15 at 18:12
  • I am guessing I am missing an edge case somewhere in there - but for the life of me I cannot see what it is. – Sh4d0wsPlyr Sep 15 '15 at 18:30
  • Do you mean that you don't understand how this whole chain thing works? If that is the case, just try to get as many combinations you can with the cases I have mentioned to paper, I promise that you'll end up with the same number of combinations that are in my post. If that is NOT the case, thanks for trying! :) – Yytsi Sep 15 '15 at 18:36