0

I came across this problem in an unrelated program I'm writing and I've spent quite a few hours trying to solve it because I though it would be fun. It was but I was unable to do it all the way. My code only solves the sequence of some subsets. This problem also feels like a general math problem that has probably been solved in a wide variety of ways over the decades but I lack the mathematical skills and terminology to find a solution or indeed anything about this specific problem online.

I have a set of subsequences that I know to be a part of a larger, unknown (super?)sequence. I don't think these subsequences are sets in the mathematical sense because they are ordered but they are similar in that they do not contain duplicate elements. Same goes for the master/super/whateversequence. (For clarity I will be referring to this as supersequence.)

The subsequences all contain the same type of data, however the data is not ordered alphabetically, in an ascending order, or anything like that. In a sense the data is in an arbitrary order: in that of the supersequence. And that is what I am interested in. I want to find the unknown supersequence of these subsequences.

For the sake of simplicity I've tried to solve this problem using letters of the alphabet, but I can later refactor the code to suit my needs. Obviously because I'm still trying to solve this problem I started by coming up with a suitable word for supersequence that does not contain duplicate elements: FLOWCHARTS.

Then I came up with the following six subsequences:

F,W,C,R
L,H,A
L,O,H,A,R,S
C,S
R,T,S
F,O,W,H,A,S

Here's my sequence ordering method:

// LinkedHashMappedKeyValueList keeps the data in the order it was inserted and allows one key to have multiple values.
private static LinkedHashSet<Character> orderSequence(final Set<Character> unorderedSequence, final LinkedHashMappedKeyValueList ruleMap)
{
    List<Character> orderedSequence = new ArrayList<Character>(unorderedSequence);

    // Order the sequence according to the rules.
    System.out.println("---- ORDERING SEQUENCE ----");

    for (Map.Entry<Character, LinkedHashSet<Character>> rule : ruleMap.entrySet())
    {
        char currentChar = rule.getKey();
        LinkedHashSet<Character> ruleChars = rule.getValue();

        System.out.println("Processing rule " + currentChar + "<" + ruleChars.toString());

        if (orderedSequence.contains(currentChar))
        {
            int ruleCharIndex = -1;
            int smallestRuleCharIndex = Integer.MAX_VALUE;
            Iterator<Character> it = ruleChars.iterator();

            // Find the rule character with the smallest index.
            while (it.hasNext())
            {
                char ruleChar = it.next();
                ruleCharIndex = orderedSequence.indexOf(ruleChar);
                System.out.println("\tChecking for rule character: " + ruleChar + " (" + ruleCharIndex + ")");

                if (ruleCharIndex > -1 && smallestRuleCharIndex > ruleCharIndex)
                    smallestRuleCharIndex = ruleCharIndex;
            }

            if (smallestRuleCharIndex != Integer.MAX_VALUE)
                System.out.println("\tMoving '" + currentChar + "' before '"
                        + orderedSequence.get(smallestRuleCharIndex) + "'.");
            else
                System.out.println("\tMoving '" + currentChar + "' to the end of the sequence.");

            if (!moveBefore(orderedSequence.indexOf(currentChar), smallestRuleCharIndex, orderedSequence))
                System.out.println("\tAlready in correct position.");
            else
                System.out.println("\tCurrent sequence: " + listToString(orderedSequence));
        }
        else
            throw new ArithmeticException("Element of a subsequence not a part of the sequence.");
    }

    return new LinkedHashSet<Character>(orderedSequence);
}

In the end my code found the supersequence F,L,O,W,H,C,A,R,T,S for these subsequences which is pretty close but not perfect. I also need to run my ordering method multiple times so the "algorithm" I came up with isn't perfect either. The "rule map" thing is a hash map where the key is another hash map of Character objects that came after the key Character in the subsequences (and thus in the supersequence).

Is there a Java library of some sort I could use that does this kind of sequence finding? Can someone point me in the right direction in terms of telling me what this is called and/or helping me find the right algorithms for the job?

Additionally, a shortened console output of my program:

---- BUILDING RULE MAP ----
Subsequences:   F,W,C,R
        L,H,A
        L,O,H,A,R,S
        C,S
        R,T,S
        F,O,W,H,A,S

All subsequences processed. Number of ordering rules: 10
Rule map: (F<[W, O]),(W<[C, H]),(C<[R, S]),(R<[, S, T]),(L<[H, O]),(H<[A]),(A<[, R, S]),(O<[H, W]),(S<[]),(T<[S])

---- BUILDING UNORDERED SEQUENCE ----
Sequence size is 10.
Unordered sequence: F,W,C,R,L,H,A,O,S,T

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'W'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Already in correct position.
Processing rule R<[, S, T]
    Moving 'R' before 'S'.
    Current sequence: F,W,C,L,H,A,O,R,S,T
Processing rule L<[H, O]
    Moving 'L' before 'H'.
    Already in correct position.
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,W,C,L,H,O,A,R,S,T
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Current sequence: F,O,W,C,L,H,A,R,S,T
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Current sequence: F,O,W,C,L,H,A,R,T,S
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,W,C,R,L,H,A,O,S,T
Ordered sequence:   F,O,W,C,L,H,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: F,O,W,L,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,O,W,C,L,H,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Current sequence: L,F,O,W,H,C,A,R,T,S
Processing rule W<[C, H]
    Moving 'W' before 'H'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: L,F,O,W,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,L,O,W,H,C,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    true
Sequence ordered according to the limits of the rule map.
Sequence found after 2 tries.

Expected sequence:  F,L,O,W,C,H,A,R,T,S FLOWCHARTS
Found sequence:     F,L,O,W,H,C,A,R,T,S FLOWHCARTS
Sequences match:    false
  • Is this in the realm of bioinformatics, or just generally math ? I'm struggling to see the big picture here – Caffeinated Nov 15 '15 at 19:37
  • may want to look into trees and just searching for specific data structures – Caffeinated Nov 15 '15 at 19:38
  • 1
    It's possible that your inputs that you chose don't allow you to calculate the complete picture. It is an interesting problem, I'll try to chug on it a bit. – markspace Nov 15 '15 at 19:38
  • 4
    None of your example inputs give you information that `C` comes before `H`. So there is no algorithm that will help you here. – Oliver Charlesworth Nov 15 '15 at 19:39
  • 1
    Though FWIW, you can model your input as a (simple) DAG. If there is a unique "solution", then you should be able to obtain it by performing a topological sort on that DAG. – Oliver Charlesworth Nov 15 '15 at 19:43
  • The information missing in the above set of sequences is `FC->R->T->S`, `F->O->W->H->A->R->T->S` and `L->O->W->H->A->R->T->S`. There is more information needed to calculate the remainder. – user1952500 Nov 15 '15 at 20:14

2 Answers2

1

What you are asking for is to compute a total order from a partial order. I could not find much work on this. However we can discuss the problem a bit here.

Consider A<B<C<D. If we have the sequences A<C, B<D and C<D, we will never be able to compute the total order. We only get A<C<D and B<D.

I think it can be proven that we will need all of the N-1 relations of the form X<Y with X and Y present consecutively in the final chain in order to reconstruct the total order (there may be additional ones but those are extra information). As a non-rigorous demonstration, suppose we have A1<A2<A3<...<AN and suppose we are able to reconstruct from the partial orders A_begin to A_end. Now in order to fit this into the right position in the total-order, we will need to know that A_(begin-1)<A_begin. There is no other relation that will let it fit into the total order. Continuing this downwards into A_begin..A_end we should be able to show by some sort of induction / infinite descent that we will need all the relations given by the consecutive characters of the word in order to reconstruct it.

The information missing in the above set of sequences is F<L and C<H. The sequences that can be obtained are W->C->R->T->S, F->O->W->H->A->R->T->S and L->O->W->H->A->R->T->S. There is more information needed to calculate the remainder.

In the above case we have, after decomposition and duplicate elimination the following relations:

A,R
A,S <-- redundant since R<S and A<R
C,R
C,S <-- redundant since R<S and A<R
F,O 
F,W <-- redundant since O<W and F<O
H,A
L,H <-- redundant since O<H and L<O
L,O
O,H <-- redundant since O<W and W<H
O,W
R,S <-- redundant since T<S and R<T
R,T
T,S
W,C
W,H

There are 16 relations of which 6 are immediately redundant. Removing redundancies we get the following 10 relations:

A,R <-- consecutive letters in actual word
C,R
F,O 
H,A <-- consecutive letters in actual word
L,O <-- consecutive letters in actual word
O,W <-- consecutive letters in actual word
R,T <-- consecutive letters in actual word
T,S <-- consecutive letters in actual word
W,C <-- consecutive letters in actual word
W,H 

The only ones missing from the original sequence are F<L and C<H. The additional given relations C<R, F<O and W,H have repeated LHS or RHS and give un-actionable information (basically these link two partially ordered chains but not at end-points, so you know a chain is to be merged or is less than another but don't know where).

There are multiple ways to implement this once the missing relations are added. Your code may work by itself.

user1952500
  • 6,611
  • 3
  • 24
  • 37
0

The problem I am describing in my question is the shortest common supersequence problem. I did not search this online because I was too focused on mathematical sets while working on the code, and only once I had started writing out my question did I realize that I was working with sequences. In fact I thought I just made up the word "supersequence" on the spot. Turns out that is precisely the word I needed to search online to find several pages of material related to this problem.

Oliver Charlesworth's comment is absolutely correct. The given subsequences lack information to make the correct supersequence so in my example, there is no way of finding the actual supersequence. Coffee's comment about bioinformatics most likely refers to shortest common superstring problem which has applications in reconstruction of DNA sequences, something discussed in this question. The shortest common supersequence and superstring problems are quite similar at first look, however, in the superstring problem the substrings must be composed of adjacent elements of the superstring unlike with the supersequence problem. (Illustrated below.)

Difference between the shortest common supersequence and superstring problems.

Both problems are also NP-complete or "NP-hard". The way I understand it, is that these problems do not have an optimal solution or some magical algorithm that you can just copy-paste into your code. There are only approximations and "good enough" solutions.

Surprisingly enough I was not able to find full Java code that approximates this problem, but I did come across a few sites with code in other languages. I've listed below some useful resources I found while researching this problem. I have also included resources related to the shortest common superstring problem as it is related.

Additional resources for further study