Update: the original analysis was wrong and failed on some class of test cases, as pointed out by Eric Zhang.
I believe this can be solved with a form of topological sort. Your initial list of words defines a partial order or a directed graph on some set of letters. You wish to find a substitution that linearizes this graph of letters. Let's use one of your non-trivial examples:
P A R K O V I S T E
P A R A D O N T O Z A
P A D A K
A B B A
A B E C E D A
A B S I N T
Let x <* y
indicate that substitution(x) < substitution(y)
for some letters (or words) x
and y
. We want word1 <* word2 <* word3 <* word4 <* word5 <* word6
overall, but in terms of letters, we just need to look at each pair of adjacent words and find the first pair of differing characters in the same column:
K <* A (from PAR[K]OVISTE <* PAR[A]DONTOZA)
R <* D (from PA[R]ADONTOZA <* PA[D]AK)
P <* A (from [P]ADAK <* [A]BBA)
B <* E (from AB[B]A <* AB[E]CEDA)
E <* S (from AB[E]CEDA <* AB[S]INT)
If we find no mismatched letters, then there are 3 cases:
- word 1 and word 2 are the same
- word 1 is a prefix of word 2
- word 2 is a prefix of word 1
In case 1 and 2, the words are already in lexicographic order, so we don't need to perform any substitutions (although we might) and they add no extra constraints that we need to adhere to. In case 3, there is no substitution at all that will fix this (think of ["DOGGO", "DOG"]
), so there's no possible solution and we can quit early.
Otherwise, we build the directed graph corresponding to the partial ordering information we obtained and perform a topological sort. If the sorting process indicates that no linearization is possible, then there is no solution for sorting the list of words. Otherwise, you get back something like:
P <* K <* R <* B <* E <* A <* D <* S
Depending on how you implement your topological sort, you might get a different linear ordering. Now you just need to assign each letter a substitution that respects this ordering and is itself sorted alphabetically. A simple option is to pair the linear ordering with itself sorted alphabetically, and use that as the substitution:
P <* K <* R <* B <* E <* A <* D <* S
| | | | | | | |
A < B < D < E < K < P < R < S
But you could implement a different substitution rule if you wish.
Here's a proof-of-concept in Python:
import collections
import itertools
# a pair of outgoing and incoming edges
Edges = collections.namedtuple('Edges', 'outgoing incoming')
# a mapping from nodes to edges
Graph = lambda: collections.defaultdict(lambda: Edges(set(), set()))
def substitution_sort(words):
graph = build_graph(words)
if graph is None:
return None
ordering = toposort(graph)
if ordering is None:
return None
# create a substitition that respects `ordering`
substitutions = dict(zip(ordering, sorted(ordering)))
# apply substititions
return [
''.join(substitutions.get(char, char) for char in word)
for word in words
]
def build_graph(words):
graph = Graph()
# loop over every pair of adjacent words and find the first
# pair of corresponding characters where they differ
for word1, word2 in zip(words, words[1:]):
for char1, char2 in zip(word1, word2):
if char1 != char2:
break
else: # no differing characters found...
if len(word1) > len(word2):
# ...but word2 is a prefix of word1 and comes after;
# therefore, no solution is possible
return None
else:
# ...so no new information to add to the graph
continue
# add edge from char1 -> char2 to the graph
graph[char1].outgoing.add(char2)
graph[char2].incoming.add(char1)
return graph
def toposort(graph):
"Kahn's algorithm; returns None if graph contains a cycle"
result = []
working_set = {node for node, edges in graph.items() if not edges.incoming}
while working_set:
node = working_set.pop()
result.append(node)
outgoing = graph[node].outgoing
while outgoing:
neighbour = outgoing.pop()
neighbour_incoming = graph[neighbour].incoming
neighbour_incoming.remove(node)
if not neighbour_incoming:
working_set.add(neighbour)
if any(edges.incoming or edges.outgoing for edges in graph.values()):
return None
else:
return result
def print_all(items):
for item in items:
print(item)
print()
def test():
test_cases = [
('PINEAPPLE BANANA ARTICHOKE TOMATO', True),
('ABC ABB ABD', True),
('AB AA AB', False),
('PARKOVISTE PARADONTOZA PADAK ABBA ABECEDA ABSINT', True),
('AA AB CA', True),
('DOG DOGGO DOG DIG BAT BAD', False),
('DOG DOG DOGGO DIG BIG BAD', True),
]
for words, is_sortable in test_cases:
words = words.split()
print_all(words)
subbed = substitution_sort(words)
if subbed is not None:
assert subbed == sorted(subbed), subbed
print_all(subbed)
else:
print('<no solution>')
print()
print('expected solution?', 'yes' if is_sortable else 'no')
print()
if __name__ == '__main__':
test()
Now, it's not ideal--for example, it still performs a substitution even if the original list of words is already sorted--but it appears to work. I can't formally prove it works though, so if you find a counter-example, please let me know!