Find a substitution that sorts the list

Question

Consider the following words:

PINEAPPLE
BANANA
ARTICHOKE
TOMATO

The goal is to sort it (in lexicographical order) without moving the words themselves, but using letter substitution. In this example, I can replace the letter P with A and A replace with P, so:

AINEPAALE
BPNPNP
PRTICHOKE
TOMPTO

This is a list in lexicographical order. If you switch letters, the letters will be switched in all words. It is worth noting that you can use the whole alphabet, nut just the letters in the words in the list.

I spent considerable time with this problem, but was not able to think of anything other than brute forcing it (trying all letter switch combinations) nor was I able to come up with the conditions that define when the list can be sorted.

Some more examples:

ABC
ABB
ABD

can be turned into

ACB
ACC
ACD

which satisfies the condition.

Eric Zhang · Accepted Answer · 2017-07-24T20:56:48.910

Let's assume the problem is possible for a particular case, just for now. Also, for simplicity, assume all the words are distinct (if two words are identical, they must be adjacent and one can be ignored).

The problem then turns into topological sort, though the details are slightly different from suspicious dog's answer, which misses a couple of cases.

Consider a graph of 26 nodes, labeled A through Z. Each pair of words contributes one directed edge to the partial ordering; this corresponds to the first character in which the words differ. For example, with the two words ABCEF and ABRKS in order, the first difference is in the third character, so sigma(C) < sigma(R).

The result can be obtained by doing a topological sort on this graph, and substituting A for the first node in the ordering, B for the second, etc.

Note that this also gives a useful measure of when the problem is impossible to solve. This occurs when two words are the same but not adjacent (in a "cluster"), when one word is a prefix of another but is after it, or when the graph has a cycle and topological sort is impossible.

Here is a fully functional solution in Python, complete with detection of when a particular instance of the problem is unsolvable.

def topoSort(N, adj):
    stack = []
    visited = [False for _ in range(N)]
    current = [False for _ in range(N)]

    def dfs(v):
        if current[v]: return False # there's a cycle!
        if visited[v]: return True
        visited[v] = current[v] = True
        for x in adj[v]:
            if not dfs(x):
                return False
        current[v] = False
        stack.append(v)
        return True

    for i in range(N):
        if not visited[i]:
            if not dfs(i):
                return None

    return list(reversed(stack))

def solve(wordlist):
    N = 26
    adj = [set([]) for _ in range(N)] # adjacency list
    for w1, w2 in zip(wordlist[:-1], wordlist[1:]):
        idx = 0
        while idx < len(w1) and idx < len(w2):
            if w1[idx] != w2[idx]: break
            idx += 1
        else:
            # no differences found between the words
            if len(w1) > len(w2):
                return None
            continue

        c1, c2 = w1[idx], w2[idx]
        # we want c1 < c2 after the substitution
        adj[ord(c1) - ord('A')].add(ord(c2) - ord('A'))

    li = topoSort(N, adj)
    sub = {}
    for i in range(N):
        sub[chr(ord('A') + li[i])] = chr(ord('A') + i)
    return sub

def main():
    words = ['PINEAPPLE', 'BANANA', 'ARTICHOKE', 'TOMATO']
    print('Before: ' + ' '.join(words))
    sub = solve(words)
    nwords = [''.join(sub[c] for c in w) for w in words]
    print('After : ' + ' '.join(nwords))

if __name__ == '__main__':
    main()

EDIT: The time complexity of this solution is a provably-optimal O(S), where S is the length of the input. Thanks to suspicious dog for this; the original time complexity was O(N^2 L).

Many thanks for the counterexample and correction! I wouldn't have been able to figure it out myself. Is it strictly necessary to compare every word with every other word after, or will comparing each word to the next adjacent word work as well? Is there an example where this would fail? — , Nov 25 '16 at 06:32
Yep, that's right! That's the `O(NL)` time algorithm. It's slightly more complicated to implement, though. — Eric Zhang, Nov 25 '16 at 06:47
Can't you just replace the nested i,j loop with a single loop `for i in range(len(wordlist) - 1)` and use `w1, w2 = wordlist[i], wordlist[i+1]`, or am I misunderstanding? — , Nov 25 '16 at 06:56
Sorry, ignore that last comment. Yeah, you're completely right about that! That's a very simple way to improve the time complexity significantly. — Eric Zhang, Nov 26 '16 at 16:19

score 1 · Answer 2 · edited May 23 '17 at 11:59

Update: the original analysis was wrong and failed on some class of test cases, as pointed out by Eric Zhang.

I believe this can be solved with a form of topological sort. Your initial list of words defines a partial order or a directed graph on some set of letters. You wish to find a substitution that linearizes this graph of letters. Let's use one of your non-trivial examples:

P A R K O V I S T E
P A R A D O N T O Z A
P A D A K
A B B A
A B E C E D A
A B S I N T

Let x <* y indicate that substitution(x) < substitution(y) for some letters (or words) x and y. We want word1 <* word2 <* word3 <* word4 <* word5 <* word6 overall, but in terms of letters, we just need to look at each pair of adjacent words and find the first pair of differing characters in the same column:

K <* A  (from PAR[K]OVISTE <* PAR[A]DONTOZA)
R <* D  (from PA[R]ADONTOZA <* PA[D]AK)
P <* A  (from [P]ADAK <* [A]BBA)
B <* E  (from AB[B]A <* AB[E]CEDA)
E <* S  (from AB[E]CEDA <* AB[S]INT)

If we find no mismatched letters, then there are 3 cases:

word 1 and word 2 are the same
word 1 is a prefix of word 2
word 2 is a prefix of word 1

In case 1 and 2, the words are already in lexicographic order, so we don't need to perform any substitutions (although we might) and they add no extra constraints that we need to adhere to. In case 3, there is no substitution at all that will fix this (think of ["DOGGO", "DOG"]), so there's no possible solution and we can quit early.

Otherwise, we build the directed graph corresponding to the partial ordering information we obtained and perform a topological sort. If the sorting process indicates that no linearization is possible, then there is no solution for sorting the list of words. Otherwise, you get back something like:

P <* K <* R <* B <* E <* A <* D <* S

Depending on how you implement your topological sort, you might get a different linear ordering. Now you just need to assign each letter a substitution that respects this ordering and is itself sorted alphabetically. A simple option is to pair the linear ordering with itself sorted alphabetically, and use that as the substitution:

P <* K <* R <* B <* E <* A <* D <* S
|    |    |    |    |    |    |    |
A <  B <  D <  E <  K <  P <  R <  S

But you could implement a different substitution rule if you wish.

Here's a proof-of-concept in Python:

import collections
import itertools

# a pair of outgoing and incoming edges
Edges = collections.namedtuple('Edges', 'outgoing incoming')
# a mapping from nodes to edges
Graph = lambda: collections.defaultdict(lambda: Edges(set(), set()))

def substitution_sort(words):
    graph = build_graph(words)

    if graph is None:
        return None

    ordering = toposort(graph)

    if ordering is None:
        return None

    # create a substitition that respects `ordering`
    substitutions = dict(zip(ordering, sorted(ordering)))

    # apply substititions
    return [
        ''.join(substitutions.get(char, char) for char in word)
        for word in words
    ]

def build_graph(words):
    graph = Graph()

    # loop over every pair of adjacent words and find the first
    # pair of corresponding characters where they differ
    for word1, word2 in zip(words, words[1:]):
        for char1, char2 in zip(word1, word2):
            if char1 != char2:
                break
        else: # no differing characters found...

            if len(word1) > len(word2):
                # ...but word2 is a prefix of word1 and comes after;
                # therefore, no solution is possible
                return None
            else:
                # ...so no new information to add to the graph
                continue

        # add edge from char1 -> char2 to the graph
        graph[char1].outgoing.add(char2)
        graph[char2].incoming.add(char1)

    return graph

def toposort(graph):
    "Kahn's algorithm; returns None if graph contains a cycle"
    result = []
    working_set = {node for node, edges in graph.items() if not edges.incoming}

    while working_set:
        node = working_set.pop()
        result.append(node)
        outgoing = graph[node].outgoing

        while outgoing:
            neighbour = outgoing.pop()
            neighbour_incoming = graph[neighbour].incoming
            neighbour_incoming.remove(node)

            if not neighbour_incoming:
                working_set.add(neighbour)

    if any(edges.incoming or edges.outgoing for edges in graph.values()):
        return None
    else:
        return result

def print_all(items):
    for item in items:
        print(item)
    print()

def test():    
    test_cases = [
        ('PINEAPPLE BANANA ARTICHOKE TOMATO', True),
        ('ABC ABB ABD', True),
        ('AB AA AB', False),
        ('PARKOVISTE PARADONTOZA PADAK ABBA ABECEDA ABSINT', True),
        ('AA AB CA', True),
        ('DOG DOGGO DOG DIG BAT BAD', False),
        ('DOG DOG DOGGO DIG BIG BAD', True),
    ]

    for words, is_sortable in test_cases:
        words = words.split()
        print_all(words)

        subbed = substitution_sort(words)

        if subbed is not None:
            assert subbed == sorted(subbed), subbed
            print_all(subbed)
        else:
            print('<no solution>')
            print()

        print('expected solution?', 'yes' if is_sortable else 'no')
        print()

if __name__ == '__main__':
    test()

Now, it's not ideal--for example, it still performs a substitution even if the original list of words is already sorted--but it appears to work. I can't formally prove it works though, so if you find a counter-example, please let me know!

Your answer fails for the test case `AA AB CA`. See https://repl.it/E762/0 — Eric Zhang, Nov 25 '16 at 05:22

AlphaQ · Answer 3 · 2016-11-24T19:00:02.777

0

Extract all the first letter of each word in a list. (P,B,A,T)
Sort the list. (A,B,P,T)
Replace all occurrences of the first letter in the word with the first character in the sorted list.

Replace P(Pineapple) from all words with A.

Replace B from all words with B.

Replace A from all words with P.

Replace T from all words with T.

This will give you your intended result.

Edit:

Compare two adjacent strings. If one is greater than the other, then find the first occurrence of character mismatch and swap and replace all words with the swapped characters.
Repeat this for the entire list like in bubble sort.

Example -

ABC < ABB

First occurrence of character mismatch is at 3rd position. So we swap all C's with B's.

edited Nov 24 '16 at 19:00

answered Nov 24 '16 at 18:45

AlphaQ

656
8
18

Consider the list [ABC, ABB, ABD]. Your method only solves the first character, not the whole words. – FigsHigs Nov 24 '16 at 18:47
@FigsHigs The edited answer is generalized for all strings. – AlphaQ Nov 24 '16 at 19:05
I will try to turn it into code. Another question: how do you know that the list can be sorted? (ie if there even exists such substitution) – FigsHigs Nov 24 '16 at 19:10
You can know in the first iteration of the list only when all the string compare values return 0 i.e., the strings are equal. Or when a string is less than (returns -1) the next string value. If this occurs for the entire list, then the list is already sorted. – AlphaQ Nov 24 '16 at 19:12
That's the finish state, but what about if its not sortable? How would I know? – FigsHigs Nov 24 '16 at 19:19
That's the thing about sorted lists. How do we know if the list is sortable or not? When the list is already sorted and needs no further sorting. You can just set a flag variable and set it to `false` when one element is sorted. The rest of the technical bit is mentioned in the previous comment. – AlphaQ Nov 24 '16 at 19:22
But we know for certain that some lists are unsortable, for example this: AB, AA, AB – FigsHigs Nov 24 '16 at 19:25
Well, these are special cases because the list is sorted and after sorting operation it is same as the previous list. So, it is not possible with this algorithm to realize before sorting, whether the sorting will be fruitful or not. – AlphaQ Nov 24 '16 at 19:28
Thats no problem. After pseudo-sorting just check the list if its really sorted. That does the trick. The problem is that the sorting itself doesn't work. Consider: PARKOVISTE, PARADONTOZA, PADAK, ABBA, ABECEDA, ABSINT. The result from the program is that it is not sortable (because its not sorted after the first pseudo-sorting). But a solution does exist: BCFAQVLSTH, BCFCGQPTQZC, BCGCA, CDDC, CDHEHGC, CDSLPT. It doesn't work even on the example with fruit mentioned in the question. – FigsHigs Nov 24 '16 at 19:36
It will give the result. You have to reiterate the list until the condition where no swapping of characters occur. – AlphaQ Nov 24 '16 at 19:40
But if the list is unsortable, then the iterations will go on forever. Multimple iterations didn't work. (tried sorting the fruit example 500 times). – FigsHigs Nov 24 '16 at 19:45

Find a substitution that sorts the list

3 Answers3