3

I am struggling with a "find supersequence" algorithm.

The input is for set of strings

String A = "caagccacctacatca";
String B = "cgagccatccgtaaagttg";
String C = "agaacctgctaaatgctaga";

the result would be properly aligned set of strings (and next step should be merge)

String E = "ca ag cca  cc ta    cat  c a";
String F = "c gag ccat ccgtaaa g  tt  g";
String G = " aga acc tgc  taaatgc t a ga";

Thank you for any advice (I am sitting on this task for more than a day)

after merge the superstring would be

cagagaccatgccgtaaatgcattacga

The definition of supersequence in "this case" would be something like

The string R is contained in supersequence S if and only if all characters in a string R are present in supersequence S in the order in which they occur in the input sequence R.


The "solution" i tried (and again its the wrong way of doing it) is:

public class Solution4
{
    static  boolean[][] map = null;
    static int size = 0;

    public static void main(String[] args)
    {
        String A = "caagccacctacatca";
        String B = "cgagccatccgtaaagttg";
        String C = "agaacctgctaaatgctaga";

        Stack data = new Stack();
        data.push(A);
        data.push(B);
        data.push(C);


        Stack clone1 = data.clone();
        Stack clone2 = data.clone();

        int length  =  26;
        size        =  max_size(data);

        System.out.println(size+" "+length);
        map = new boolean[26][size];

        char[] result = new char[size];

        HashSet<String> chunks = new HashSet<String>();
        while(!clone1.isEmpty())
        {
            String a = clone1.pop();

            char[] residue = make_residue(a);

            System.out.println("---");
            System.out.println("OLD     : "+a);
            System.out.println("RESIDUE : "+String.valueOf(residue));


            String[] r = String.valueOf(residue).split(" ");

            for(int i=0; i<r.length; i++)
            {
                if(r[i].equals(" ")) continue;
                //chunks.add(spaces.substring(0,i)+r[i]);
                chunks.add(r[i]);
            }
        }

        for(String chunk : chunks)
        {
            System.out.println("CHUNK   : "+chunk);
        }
    }

    static char[] make_residue(String candidate)
    {
        char[] result = new char[size];
        for(int i=0; i<candidate.length(); i++)
        {
            int pos = find_position_for(candidate.charAt(i),i);
            for(int j=i; j<pos; j++) result[j]=' ';
            if(pos==-1) result[candidate.length()-1] = candidate.charAt(i);
            else        result[pos] = candidate.charAt(i);
        }
        return result;
    }

    static int find_position_for(char character, int offset)
    {
        character-=((int)'a');

        for(int i=offset; i<size; i++)
        {
        //  System.out.println("checking "+String.valueOf((char)(character+((int)'a')))+" at "+i);
            if(!map[character][i])
            {
                map[character][i]=true;
                return i;
            }
        }
        return -1;
    }

    static String move_right(String a, int from)
    {
        return a.substring(0, from)+" "+a.substring(from);  
    }


    static boolean taken(int character, int position)
    { return map[character][position]; }

    static void take(char character, int position)
    {
        //System.out.println("taking "+String.valueOf(character)+" at "+position+" (char_index-"+(character-((int)'a'))+")");
        map[character-((int)'a')][position]=true;
    }

    static int max_size(Stack stack)
    {
        int max=0;
        while(!stack.isEmpty())
        {
            String s = stack.pop();
            if(s.length()>max) max=s.length();
        }

        return max;
    }

}
Jan Cajthaml
  • 403
  • 3
  • 13
  • 1
    What have you tried and what are you having trouble with? We don't know what you don't know. – Peter Lawrey Oct 13 '13 at 14:59
  • the code would be too long and it was/is "the wrong way to do this". – Jan Cajthaml Oct 13 '13 at 15:00
  • 1
    What do you mean by "properly aligned"? – Peter Lawrey Oct 13 '13 at 15:05
  • Hey how do you decide which char should be added first.. taking the first 3-4 chars in the sequences i came up with two possible merges : , , < aga....> result would begin as : OR , , < a g...> in this case the result would begin as : Which one is to be given preference or does it not make a difference ? – Amol Oct 13 '13 at 15:25
  • Amol: It doesn't matter which character is first as long as the superstring is valid. Peter: By properly aligned I ment that overlaping subsequences are "in the same column" – Jan Cajthaml Oct 13 '13 at 15:41
  • A, B and C don't appear to have significant overlap, and there is no way to know the order. What would you suggest is the correct solution in that example? – Peter Lawrey Oct 13 '13 at 19:15
  • Your description is not fully clear. Is it the right supersequence 'cagagccacctacatcgataaagttgctaaatgctaga"? – Łukasz Rzeszotarski Oct 13 '13 at 20:20
  • This also seems to be the supersequence right: String A = "ca agcca cc tacat c a"; String B = "c ga gccatccgta a a g t tg"; String C = " agaa cc t g c taaatgcta ga"; cagaagccatccgtacataaatgctatga – Łukasz Rzeszotarski Oct 13 '13 at 20:41
  • Unfortunately there are many solutions which are one character shorter than the suggested one, so the shortest combination is going to give you what you expected. See my answer. – Peter Lawrey Oct 13 '13 at 22:55

2 Answers2

1

You can try finding the shortest combination like this

static final char[] CHARS = "acgt".toCharArray();

public static void main(String[] ignored) {
    String A = "caagccacctacatca";
    String B = "cgagccatccgtaaagttg";
    String C = "agaacctgctaaatgctaga";
    String expected = "cagagaccatgccgtaaatgcattacga";

    List<String> ABC = new Combination(A, B, C).findShortest();
    System.out.println("expected: " + expected.length());
    System.out.println("Merged: " + ABC.get(0).length() + " " + ABC);
}

static class Combination {
    int shortest = Integer.MAX_VALUE;
    List<String> shortestStr = new ArrayList<>();
    char[][] chars;
    int[] pos;
    int count = 0;

    Combination(String... strs) {
        chars = new char[strs.length][];
        pos = new int[strs.length];
        for (int i = 0; i < strs.length; i++) {
            chars[i] = strs[i].toCharArray();
        }
    }

    public List<String> findShortest() {
        findShortest0(new StringBuilder(), pos);
        return shortestStr;
    }

    private void findShortest0(StringBuilder sb, int[] pos) {
        if (allDone(pos)) {
            if (sb.length() < shortest) {
                shortestStr.clear();
                shortest = sb.length();
            }
            if (sb.length() <= shortest)
                shortestStr.add(sb.toString());
            count++;
            if (++count % 100 == 1)
            System.out.println("Searched " + count + " shortest " + shortest);
            return;
        }
        if (sb.length() + maxLeft(pos) > shortest)
            return;
        int[] pos2 = new int[pos.length];
        int i = sb.length();
        sb.append(' ');
        for (char c : CHARS) {
            if (!tryChar(pos, pos2, c)) continue;
            sb.setCharAt(i, c);
            findShortest0(sb, pos2);
        }
        sb.setLength(i);
    }

    private int maxLeft(int[] pos) {
        int maxLeft = 0;
        for (int i = 0; i < pos.length; i++) {
            int left = chars[i].length - pos[i];
            if (left > maxLeft)
                maxLeft = left;
        }
        return maxLeft;
    }

    private boolean allDone(int[] pos) {
        for (int i = 0; i < chars.length; i++)
            if (pos[i] < chars[i].length)
                return false;
        return true;
    }

    private boolean tryChar(int[] pos, int[] pos2, char c) {
        boolean matched = false;
        for (int i = 0; i < chars.length; i++) {
            pos2[i] = pos[i];
            if (pos[i] >= chars[i].length) continue;
            if (chars[i][pos[i]] == c) {
                pos2[i]++;
                matched = true;
            }

        }
        return matched;
    }
}

prints many solutions which are shorter than the one suggested.

expected: 28
Merged: 27 [acgaagccatccgctaaatgctatcga, acgaagccatccgctaaatgctatgca, acgaagccatccgctaacagtgctaga, acgaagccatccgctaacatgctatga, acgaagccatccgctaacatgcttaga, acgaagccatccgctaacatgtctaga, acgaagccatccgctacaagtgctaga, acgaagccatccgctacaatgctatga, acgaagccatccgctacaatgcttaga, acgaagccatccgctacaatgtctaga, acgaagccatcgcgtaaatgctatcga, acgaagccatcgcgtaaatgctatgca, acgaagccatcgcgtaacagtgctaga, acgaagccatcgcgtaacatgctatga, acgaagccatcgcgtaacatgcttaga, acgaagccatcgcgtaacatgtctaga, acgaagccatcgcgtacaagtgctaga, acgaagccatcgcgtacaatgctatga, acgaagccatcgcgtacaatgcttaga, acgaagccatcgcgtacaatgtctaga, acgaagccatgccgtaaatgctatcga, acgaagccatgccgtaaatgctatgca, acgaagccatgccgtaacagtgctaga, acgaagccatgccgtaacatgctatga, acgaagccatgccgtaacatgcttaga, acgaagccatgccgtaacatgtctaga, acgaagccatgccgtacaagtgctaga, acgaagccatgccgtacaatgctatga, acgaagccatgccgtacaatgcttaga, acgaagccatgccgtacaatgtctaga, cagaagccatccgctaaatgctatcga, cagaagccatccgctaaatgctatgca, cagaagccatccgctaacagtgctaga, cagaagccatccgctaacatgctatga, cagaagccatccgctaacatgcttaga, cagaagccatccgctaacatgtctaga, cagaagccatccgctacaagtgctaga, cagaagccatccgctacaatgctatga, cagaagccatccgctacaatgcttaga, cagaagccatccgctacaatgtctaga, cagaagccatcgcgtaaatgctatcga, cagaagccatcgcgtaaatgctatgca, cagaagccatcgcgtaacagtgctaga, cagaagccatcgcgtaacatgctatga, cagaagccatcgcgtaacatgcttaga, cagaagccatcgcgtaacatgtctaga, cagaagccatcgcgtacaagtgctaga, cagaagccatcgcgtacaatgctatga, cagaagccatcgcgtacaatgcttaga, cagaagccatcgcgtacaatgtctaga, cagaagccatgccgtaaatgctatcga, cagaagccatgccgtaaatgctatgca, cagaagccatgccgtaacagtgctaga, cagaagccatgccgtaacatgctatga, cagaagccatgccgtaacatgcttaga, cagaagccatgccgtaacatgtctaga, cagaagccatgccgtacaagtgctaga, cagaagccatgccgtacaatgctatga, cagaagccatgccgtacaatgcttaga, cagaagccatgccgtacaatgtctaga, cagagaccatccgctaaatgctatcga, cagagaccatccgctaaatgctatgca, cagagaccatccgctaacagtgctaga, cagagaccatccgctaacatgctatga, cagagaccatccgctaacatgcttaga, cagagaccatccgctaacatgtctaga, cagagaccatccgctacaagtgctaga, cagagaccatccgctacaatgctatga, cagagaccatccgctacaatgcttaga, cagagaccatccgctacaatgtctaga, cagagaccatcgcgtaaatgctatcga, cagagaccatcgcgtaaatgctatgca, cagagaccatcgcgtaacagtgctaga, cagagaccatcgcgtaacatgctatga, cagagaccatcgcgtaacatgcttaga, cagagaccatcgcgtaacatgtctaga, cagagaccatcgcgtacaagtgctaga, cagagaccatcgcgtacaatgctatga, cagagaccatcgcgtacaatgcttaga, cagagaccatcgcgtacaatgtctaga, cagagaccatgccgtaaatgctatcga, cagagaccatgccgtaaatgctatgca, cagagaccatgccgtaacagtgctaga, cagagaccatgccgtaacatgctatga, cagagaccatgccgtaacatgcttaga, cagagaccatgccgtaacatgtctaga, cagagaccatgccgtacaagtgctaga, cagagaccatgccgtacaatgctatga, cagagaccatgccgtacaatgcttaga, cagagaccatgccgtacaatgtctaga, cagagccatcctagctaaagtgctaga, cagagccatcctagctaaatgctatga, cagagccatcctagctaaatgcttaga, cagagccatcctagctaaatgtctaga, cagagccatcctgactaaagtgctaga, cagagccatcctgactaaatgctatga, cagagccatcctgactaaatgcttaga, cagagccatcctgactaaatgtctaga, cagagccatcctgctaaatgctatcga, cagagccatcctgctaaatgctatgca, cagagccatcctgctaacagtgctaga, cagagccatcctgctaacatgctatga, cagagccatcctgctaacatgcttaga, cagagccatcctgctaacatgtctaga, cagagccatcctgctacaagtgctaga, cagagccatcctgctacaatgctatga, cagagccatcctgctacaatgcttaga, cagagccatcctgctacaatgtctaga]

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    hey, just a small suggestion.. wouldn't adding a "break", after setting "found = true" be better. Since we have found a character for position i why loop through the other strings ? – Amol Oct 13 '13 at 15:18
  • I think that merging is not a problem here. The problem is to find minimal 'aligned' partitioning and at the end minimal supersequence. – Łukasz Rzeszotarski Oct 13 '13 at 20:57
  • @ŁukaszRzeszotarski I agree, I think I understand the problem now and it appears you need to know what the merged string is first, so you can derive the sub sequences. – Peter Lawrey Oct 13 '13 at 21:38
  • @PeterLawrey Not exactly, because finding the merged sequence is the goal of this task. See my answer. – Łukasz Rzeszotarski Oct 13 '13 at 22:07
  • @ŁukaszRzeszotarski In my solution, I find the shortest combinations and from that you can determine the "properly aligned" strings – Peter Lawrey Oct 13 '13 at 23:06
  • @PeterLawrey If you are able to find SCS using bruteforce this is ok. But this what you write that then you are able to get property aligned string has not too much sense to me. As I undestand the problem and proposed by the author algorithm - properly aligned string is only a first step for finding merged string which is requested SCS string. So what for you want to search for properly aligned string when you have final merged string. – Łukasz Rzeszotarski Oct 13 '13 at 23:31
  • @ŁukaszRzeszotarski yes, the shortest string you pick determines which "proper aligned" arrangement matches that. The fact there are many solutions which have the same shortest length suggests there is not one answer to this problem (or there is not enough information to get a unique solution) – Peter Lawrey Oct 14 '13 at 00:04
1

Finding any common supersequence is not a difficult task:

In your example possible solution would be something like:

public class SuperSequenceTest {

public static void main(String[] args) {
    String A = "caagccacctacatca";
    String B = "cgagccatccgtaaagttg";
    String C = "agaacctgctaaatgctaga";

    int iA = 0;
    int iB = 0;
    int iC = 0;

    char[] a = A.toCharArray();
    char[] b = B.toCharArray();
    char[] c = C.toCharArray();


    StringBuilder sb = new StringBuilder();

    while (iA < a.length || iB < b.length || iC < c.length) {
        if (iA < a.length && iB < b.length && iC < c.length && (a[iA] == b[iB]) && (a[iA] == c[iC])) {
            sb.append(a[iA]);
            iA++;
            iB++;
            iC++;
        }
        else if (iA < a.length && iB < b.length && a[iA] == b[iB]) {
            sb.append(a[iA]);
            iA++;
            iB++;
        }
        else if (iA < a.length && iC < c.length && a[iA] == c[iC]) {
            sb.append(a[iA]);
            iA++;
            iC++;
        }
        else if (iB < b.length && iC < c.length && b[iB] == c[iC]) {
            sb.append(b[iB]);
            iB++;
            iC++;
        } else {
            if (iC < c.length) {
                sb.append(c[iC]);
                iC++;
            }
            else if (iB < b.length) {
                sb.append(b[iB]);
                iB++;
            } else if (iA < a.length) {
                sb.append(a[iA]);
                iA++;
            }
        }
    }
    System.out.println("SUPERSEQUENCE " + sb.toString());
}

}

However the real problem to solve is to find the solution for the known problem of Shortest Common Supersequence http://en.wikipedia.org/wiki/Shortest_common_supersequence, which is not that easy.

There is a lot of researches which concern the topic.

See for instance:

http://www.csd.uwo.ca/~lila/pdfs/Towards%20a%20DNA%20solution%20to%20the%20Shortest%20Common%20Superstring%20Problem.pdf

http://www.ncbi.nlm.nih.gov/pubmed/14534185

Łukasz Rzeszotarski
  • 5,791
  • 6
  • 37
  • 68