Here's an approach which doesn't use DFAs.
I considered the problem of just testing whether there is a string of a given length k
satisfies every pattern, and returning one if there is one. This is a slightly simpler problem, but we can find the shortest string satisfying every pattern by testing for increasing k
until we find a solution (or until k
exceeds some upper bound on the length of a shortest solution).
My solution isn't fast enough to solve the larger examples within 10-20 seconds, but perhaps the idea will be useful anyway. I used the Z3 theorem prover in Python, but there are bindings for many languages.
Essentially, the idea is to define k
formal variables representing the letters in the string, then build a huge logical expression encoding the patterns, and use the Z3 solver to search for a solution (or verify that no solution exists).
There are a few extra tricks to improve efficiency. We can start by simplifying the patterns; it's better to have ?
s before *
s to reduce the branching factor, and any sequence of two or more *
s is redundant. We can also work out the minimum possible length of a solution based on the non-*
symbols in the patterns.
In principle the same idea could be used to test for strings of length <= k
, optimising for the shortest one, by introducing a $
symbol into the alphabet, adding the constraint that $
cannot appear before any symbol other than $
, and then solving the optimisation problem to maximise the the number of $
symbols. I did not try this; there are also some other easy improvements that could be made, but I tried to keep it simple as a proof of concept.
My implementation is below. The main issue affecting performance is that the logical expression's size is exponential in the number of *
s in a given pattern.
from z3 import *
def simplify_pattern(p):
q = ''
while p != q:
q = p
p = p.replace('*?', '?*').replace('**', '*')
return p
def min_length(p):
return len(p.replace('*', ''))
def solve(patterns, max_length):
patterns = [simplify_pattern(p) for p in patterns]
m = max(map(min_length, patterns))
for k in range(m, max_length+1):
s = solve_for_length(patterns, k)
if s is not None:
return s
return None
def solve_for_length(patterns, k):
alphabet = sorted(set(''.join(patterns)) - set('?*'))
vs = [Int('v' + str(i)) for i in range(k)]
constraints = [And(0 <= v, v < len(alphabet)) for v in vs]
for p in patterns:
cs = constraints_for_pattern(p, vs, alphabet)
constraints.extend(cs)
s = Solver()
s.add(constraints)
if s.check() == sat:
m = s.model()
idx = [m.eval(v).as_long() for v in vs]
return ''.join(alphabet[i] for i in idx)
else:
return None
def constraints_for_pattern(pattern, vs, alphabet):
while pattern and pattern[0] != '*':
if not vs:
# output string shorter than pattern
yield False
return
if pattern[0] != '?':
yield vs[0] == alphabet.index(pattern[0])
pattern = pattern[1:]
vs = vs[1:]
if pattern == '*':
pass
elif pattern:
# pattern starts with '*' but != '*'
p = pattern[1:]
yield Or(*[
And(*constraints_for_pattern(p, vs[i:], alphabet))
for i in range(len(vs) - min_length(p) + 1)
])
elif vs:
# output string longer than pattern
yield False
Examples:
>>> solve(['a?', '?b'], 2)
'ab'
>>> solve(['a*', '*b'], 2)
'ab'
>>> solve(['a*', '?b*', '*c'], 3)
'abc'
>>> solve(['a*a*a*', '*b*b*b'], 5) is None
True
>>> solve(['a*a*a*', '*b*b*b'], 6)
'ababab'
>>> solve(['*a*b*', '*c*d*', '*e*?*'], 10)
'eabcd'
>>> solve(['?a*b', 'a*b*', '*a??a*'], 10)
'aaaab'
>>> mini_5 = [
... "*i*o*?*?*u?*??*e?o*e*a*a*i*ee*",
... "*e*?ue*o*i*?*e*u*i*?*oa?*??*",
... "*?oi*i??uu*a*iu*",
... "*o*e*ea?*eu*?e*",
... "*u*?oe*u*e?*e?*",
... ]
>>> solve(mini_5, 33) # ~20 seconds on my machine
'ioiiueuueaoeiueuiaoaiee'