3

I have multiple wildcard expressions like:

*a*b*
*c*d*
*e*?*

where

  • * means 0 or more letters (they can be any letters, not necessarily the same)
  • ? means single occurance of any letter

I need to find shortest string matching those wildcard patterns. e.g. in the example above one of those strings would be:

abced

Also another example:

?a*b
a*b*
*a??a*

and result would be

aa?ab -> meaning "aaaab" OR "aabab" OR ...

I guess I need to use dynamic programming here. Tried some approaches but only got partial results. Any ideas?

Bojan Vukasovic
  • 2,054
  • 22
  • 43
  • 2
    *"`*` means 0 or more of any letter"* - that's a little ambiguous - does it have to be 0 or more of the same letter? – kaya3 Feb 05 '20 at 22:33
  • 6
    The first thing that comes to mind is to convert each wildcard expression into a [DFA](https://en.wikipedia.org/wiki/Deterministic_finite_automaton), take the Cartesian product of the DFAs, and use BFS to find the shortest path from the starting node to an accepting node. The DFA construction is simpler than the general case for regexes because you have no "unions", so you don't need to go via NFAs. You can do BFS without constructing the Cartesian product explicitly, since it's an [implicit graph](https://en.wikipedia.org/wiki/Implicit_graph). – kaya3 Feb 05 '20 at 22:39
  • no. it can be any letter (e.g. a,b,c,d,...) – Bojan Vukasovic Feb 05 '20 at 22:43
  • 4
    What @kaya3 said, but A* works better than BFS, and you *will* have to start with NFAs because of patterns like *a?a, or at least do something more than a greedy matcher. – Matt Timmermans Feb 05 '20 at 23:25
  • @MattTimmermans Yes, that's a good counterexample. Intuitively it seems like disallowing unions should make the problem substantially easier than it is for regexes, though, so I am guessing, but my guess is that some kind of direct construction of the DFA is possible. I don't have a clear idea how to do it. – kaya3 Feb 05 '20 at 23:36
  • Actually, on second thoughts there are regexes with "exponential blowup" that don't use unions, so e.g. the wildcard expression `*a?????` as a DFA needs O(2^n) states where n is the number of question-marks. So even if there is a direct construction, it won't be more efficient in the worst case than building an NFA and doing the powerset construction. – kaya3 Feb 05 '20 at 23:44
  • 1
    Yeah, this problem is NP-hard by reduction from, e.g., max independent set, so it'll be exponential in the worst case no matter what. – Matt Timmermans Feb 06 '20 at 00:11
  • I added DFA for second example in question. Is it possible to describe how I do BFS so I get to `aabab`? – Bojan Vukasovic Feb 06 '20 at 00:13
  • @Bojan each vertex in the implicit graph is a vector of state numbers for each DFA, that they could be in after all accepting the same string. There's an edge between two vertices if you can get from one to the other by matching one letter. Starting at the vector containing all the start states, BFS to find the shortest path to one containing all accepting states. – Matt Timmermans Feb 06 '20 at 00:18
  • So I take any of 3 graphs, follow 1st edge (e.g. top left has `?`), then go to other ones to see if they can change state (match `?` against `a` and `a` edges in the others)? Then I follow other edge and get `a`. But this one cannot be matched with `b` so I follow `*` edge on the other one. Something like this? – Bojan Vukasovic Feb 06 '20 at 00:23
  • also, i guess when i enter those * loops, there will be infinite loop without some condition when to stop? unless bfs find something shorter before? – Bojan Vukasovic Feb 06 '20 at 00:41
  • this is some kind of solution that works for shorter cases https://ideone.com/EOyJZy. I'm using A* search here but maybe my heuristics is too slow? The problem is that I can change heuristics for faster one, but then I do not get the shortest match. – Bojan Vukasovic Feb 06 '20 at 16:34
  • @kaya3 Any idea how could I improve code to support these larger inputs? – Bojan Vukasovic Feb 06 '20 at 21:39
  • *"Solve problem for last two pattern examples (meaning to have solution in 10-20 seconds and not to get out of memory exception) (one having 5 input patterns and other having 95"* - Please edit these examples into the question. – kaya3 Feb 10 '20 at 01:31
  • Isn't the solution to the second one the more general "aa?ab", or am I missing something? – Hans Olsson Feb 10 '20 at 14:08
  • @HansOlsson yes, that is the most general solution - but we need only shortest one (no matter is it most general or not). – Bojan Vukasovic Feb 10 '20 at 14:14
  • But the first solutions end with "?". Is that a question-mark, or any character. – Hans Olsson Feb 10 '20 at 14:18
  • @HansOlsson I fixed that one. Was mistake with `?` - not needed there. – Bojan Vukasovic Feb 10 '20 at 15:15

5 Answers5

4

Works for smaller examples. For larger ones, takes too much time. Maybe because I calculate transitions using arrays.

import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Objects;
import java.util.Queue;
import java.util.Set;

public class Safe {
  public static void main(String[] args) {
    String[] patterns = new String[3];
    patterns[0] = "?a*b";
    patterns[1] = "a*b*";
    patterns[2] = "*a??a*";
    String sol = new Safe().solution(patterns);

    System.out.println(sol);
    if(!sol.equals("aabab")){
      throw new RuntimeException("Not Equal!");
    }

    patterns = new String[3];
    patterns[0] = "*a*b*";
    patterns[1] = "*c*d*";
    patterns[2] = "*e*f*";
    System.out.println(new Safe().solution(patterns));

    patterns = new String[3];
    patterns[0] = "*p?qp?*bd*pd?q*";
    patterns[1] = "*qp*d?b*?p?d*";
    patterns[2] = "?*d?b???q*q?p*";
    sol = new Safe().solution(patterns);

    System.out.println(sol);
    if(!sol.equals("p?qpd?bdpdqqppd")){
      throw new RuntimeException("Not Equal!");
    }

    patterns = new String[2];
    patterns[0] = "*z*y*y*z*x*x*x*z*x*z*y*z*y*x*x*y*y*y*z*x*y*z*y*x*x*x*z*x*z*z*z*y*y*z*x*y*z*";
    patterns[1] = "*y*z*z*x*x*y*z*y*z*y*x*z*y*z*y*x*z*y*x*y*x*y*y*z*x*y*z*x*x*z*y*z*y*y*x*z*y*";

    sol = new Safe().solution(patterns);

    System.out.println(sol);
    if(!sol.equals("yzyyzxxyzyxzyxzyzyxzyxyxyyzxyzyxxxzyxzzzyyxzxyz")){
      throw new RuntimeException("Not Equal!");
    }
  }

  private boolean isItLetter(char c) {
    return c != '*' && c != '?';
  }

  private char match(String[] patterns, int[] indices) {
    boolean firstThrown = true;
    try {
      char matched = '*';
      for (int i = 0; i < patterns.length; i++) {
        char c = patterns[i].charAt(indices[i]);
        firstThrown = false;
        if (isItLetter(c)) {
          if (isItLetter(matched)) {
            if (matched != c) {
              //two different letters
              //so we cannot continue
              return Character.MIN_VALUE;
            }
          }
          matched = c;
        } else {
          //* or ?
          if (!isItLetter(matched)) {
            //so we do not overwrite ? with *
            if (c == '?') {
              matched = c;
            }
          }
        }
      }
      return matched;
    } catch (StringIndexOutOfBoundsException e) {
      //means we tried matching end of string
      //check if all are at the end
      if (firstThrown) {
        //check only if first thrown;
        //otherwise, one of patterns is not at the end
        for (int i = 1; i < patterns.length; i++) {
          if (indices[i] < patterns[i].length()) {
            return Character.MIN_VALUE;
          }
        }
      } else {
        return Character.MIN_VALUE;
      }
      //max when we reached the end
      return Character.MAX_VALUE;
    }
  }

  class Node{
    final String recognized;
    final int[] indices;

    public Node(String recognized, int[] indices) {
      this.recognized = recognized;
      this.indices = indices;
    }

    @Override
    public boolean equals(Object o) {
      if (this == o) return true;
      if (o == null || getClass() != o.getClass()) return false;
      Node node = (Node) o;
      return recognized.equals(node.recognized) &&
          Arrays.equals(indices, node.indices);
    }

    @Override
    public int hashCode() {
      int result = Objects.hash(recognized);
      result = 31 * result + Arrays.hashCode(indices);
      return result;
    }
  }

  Queue<Node> queue = new ArrayDeque<>();
  Set<Node> knownStates = new HashSet<>();

  public String solution(String[] patterns){
    List<int[]> startingIndices = generateStartingIndices(patterns, new int[patterns.length]);

    for(int[] si : startingIndices){
      Node nextNode = new Node("", si);
      queue.offer(nextNode);
    }

    String result = null;
    while (result == null) {
      result = solution(patterns, queue.poll());
    }
    return result;
  }

  private List<int[]> generateStartingIndices(String[] patterns, int[] indices){
    List<int[]> generated = Collections.singletonList(new int[patterns.length]);
    for(int i=0; i<patterns.length; i++){
      char currentChar = patterns[i].charAt(indices[i]);
      List<int[]> replaceGenerated = new ArrayList<>();
      if(currentChar == '*'){
        for(int[] gi : generated){
          gi[i] = indices[i]+1;
          replaceGenerated.add(gi);
        }
        for(int[] gi : generated){
          int[] gin = gi.clone();
          gin[i] = indices[i];
          replaceGenerated.add(gin);
        }
      }
      else{
        for(int[] gi : generated){
          gi[i] = indices[i];
          replaceGenerated.add(gi);
        }
      }
      generated = replaceGenerated;
    }
    return generated;
  }

  private List<int[]> generateNextIndices(String[] patterns, int[] indices){
    List<int[]> generated = Collections.singletonList(new int[patterns.length]);
    for(int i=0; i<patterns.length; i++){
      char currentChar = patterns[i].charAt(indices[i]);
      char nextChar = Character.MIN_VALUE;
      if(indices[i]+1 < patterns[i].length()){
        nextChar = patterns[i].charAt(indices[i]+1);
      }
      List<int[]> replaceGenerated = new ArrayList<>();
      if(currentChar == '*'){
        //or next index
        for(int[] gi : generated){
          gi[i] = indices[i]+1; //short it first so we first test that case
          replaceGenerated.add(gi);
        }
        //same index or
        for(int[] gi : generated){
          int[] gin = gi.clone();
          gin[i] = indices[i];
          replaceGenerated.add(gin);
        }
      }
      else{
        //some character
        if(nextChar=='*'){
          //or if * we can skip
          for(int[] gi : generated){
            gi[i] = indices[i]+2; //skip first so we check that shorter case
            replaceGenerated.add(gi);
          }
        }
        //we can go next or
        for(int[] gi : generated){
          int[] gin = gi.clone();
          gin[i] = indices[i]+1;
          replaceGenerated.add(gin);
        }
      }
      generated = replaceGenerated;
    }
    return generated;
  }

  public String solution(String[] patterns, Node node) {
    char matched = match(patterns, node.indices);

    if (matched == Character.MAX_VALUE) {
      //we reached the end
      return node.recognized;
    }
    if (matched == Character.MIN_VALUE) {
      //impossible to match
      return null;
    }
    if(matched == '*'){
      //all stars
      return null;
    }

    List<int[]> nextIndices = generateNextIndices(patterns, node.indices);

    for(int[] ni : nextIndices){
        Node nextNode = new Node(node.recognized + matched, ni);
        if(notKnownState(nextNode)) {
          queue.offer(nextNode);
        }
    }

    return null;
  }

  private boolean notKnownState(Node node) {
    if(knownStates.contains(node)){
      return false;
    }
    knownStates.add(node);
    return true;
  }
}
Bojan Vukasovic
  • 2,054
  • 22
  • 43
3

Here's an approach which doesn't use DFAs.

I considered the problem of just testing whether there is a string of a given length k satisfies every pattern, and returning one if there is one. This is a slightly simpler problem, but we can find the shortest string satisfying every pattern by testing for increasing k until we find a solution (or until k exceeds some upper bound on the length of a shortest solution).

My solution isn't fast enough to solve the larger examples within 10-20 seconds, but perhaps the idea will be useful anyway. I used the Z3 theorem prover in Python, but there are bindings for many languages.


Essentially, the idea is to define k formal variables representing the letters in the string, then build a huge logical expression encoding the patterns, and use the Z3 solver to search for a solution (or verify that no solution exists).

There are a few extra tricks to improve efficiency. We can start by simplifying the patterns; it's better to have ?s before *s to reduce the branching factor, and any sequence of two or more *s is redundant. We can also work out the minimum possible length of a solution based on the non-* symbols in the patterns.

In principle the same idea could be used to test for strings of length <= k, optimising for the shortest one, by introducing a $ symbol into the alphabet, adding the constraint that $ cannot appear before any symbol other than $, and then solving the optimisation problem to maximise the the number of $ symbols. I did not try this; there are also some other easy improvements that could be made, but I tried to keep it simple as a proof of concept.

My implementation is below. The main issue affecting performance is that the logical expression's size is exponential in the number of *s in a given pattern.

from z3 import *

def simplify_pattern(p):
    q = ''
    while p != q:
        q = p
        p = p.replace('*?', '?*').replace('**', '*')
    return p

def min_length(p):
    return len(p.replace('*', ''))

def solve(patterns, max_length):
    patterns = [simplify_pattern(p) for p in patterns]
    m = max(map(min_length, patterns))
    for k in range(m, max_length+1):
        s = solve_for_length(patterns, k)
        if s is not None:
            return s
    return None

def solve_for_length(patterns, k):
    alphabet = sorted(set(''.join(patterns)) - set('?*'))
    vs = [Int('v' + str(i)) for i in range(k)]
    constraints = [And(0 <= v, v < len(alphabet)) for v in vs]
    for p in patterns:
        cs = constraints_for_pattern(p, vs, alphabet)
        constraints.extend(cs)

    s = Solver()
    s.add(constraints)
    if s.check() == sat:
        m = s.model()
        idx = [m.eval(v).as_long() for v in vs]
        return ''.join(alphabet[i] for i in idx)
    else:
        return None

def constraints_for_pattern(pattern, vs, alphabet):
    while pattern and pattern[0] != '*':
        if not vs:
            # output string shorter than pattern
            yield False
            return

        if pattern[0] != '?':
            yield vs[0] == alphabet.index(pattern[0])
        pattern = pattern[1:]
        vs = vs[1:]

    if pattern == '*':
        pass
    elif pattern:
        # pattern starts with '*' but != '*'
        p = pattern[1:]
        yield Or(*[
            And(*constraints_for_pattern(p, vs[i:], alphabet))
            for i in range(len(vs) - min_length(p) + 1)
        ])
    elif vs:
        # output string longer than pattern
        yield False

Examples:

>>> solve(['a?', '?b'], 2)
'ab'
>>> solve(['a*', '*b'], 2)
'ab'
>>> solve(['a*', '?b*', '*c'], 3)
'abc'
>>> solve(['a*a*a*', '*b*b*b'], 5) is None
True
>>> solve(['a*a*a*', '*b*b*b'], 6)
'ababab'
>>> solve(['*a*b*', '*c*d*', '*e*?*'], 10)
'eabcd'
>>> solve(['?a*b', 'a*b*', '*a??a*'], 10)
'aaaab'
>>> mini_5 = [
...     "*i*o*?*?*u?*??*e?o*e*a*a*i*ee*",
...     "*e*?ue*o*i*?*e*u*i*?*oa?*??*",
...     "*?oi*i??uu*a*iu*",
...     "*o*e*ea?*eu*?e*",
...     "*u*?oe*u*e?*e?*",
... ]
>>> solve(mini_5, 33) # ~20 seconds on my machine
'ioiiueuueaoeiueuiaoaiee'
kaya3
  • 47,440
  • 4
  • 68
  • 97
  • One question - how to conclude that 5 is min. length here `?a*b`, `a*b*`, `*a??a*`, since if I'm counting non stars it is `3`, `2` and `4` - where the min. length of intersected problem is `5`? I guess if I knew minimal length I could do DFS with backtrack in my examples? – Bojan Vukasovic Feb 10 '20 at 08:39
  • 1
    `min_length` will give 4; the 5 here is the `max_length` argument. The solver will try 4 first, find no solution, then try 5. `min_length` is just a weak lower bound. – kaya3 Feb 10 '20 at 08:40
  • ok. so it is actually where to start looking from param. for algorithm. – Bojan Vukasovic Feb 10 '20 at 08:42
0

There's a formal algorithm to get the optimal answer.

First convert each regex to a deterministic finite automaton (DFA). For example, you can use Thompson's construction to get an NFA and then the subset construction to get the DFA. But there are other, more direct methods.

Actually these aren't full regexes with alternation, so building DFAs is especially simple.

For this collection of DFAs, construct the "intersection" version of the cross-product DFA. This is a well-known algorithm. You'll end up with a DFA that accepts exactly the strings accepted by all the originals.

Then do breadth first search from the start state of that DFA to find the shortest accepted string (or verify that there are none).

Run time will be proportional to the product of sizes of DFAs constructed from original regexes. Those are linear in the regex size for regexes that appear in practice nearly all the time. But in weird pathological cases, they can be exponential in the regex size. Such is the price of optimality.

While this may all sound hairy, these algorithms are not complicated. If the number of regexes isn't too big, and they're not weird pathalogical cases, this method will be very practical. Otoh it's not something you'll gin up in an afternoon.

Gene
  • 46,253
  • 4
  • 58
  • 96
0

I also managed to find the library that does this stuff automatically. But even this one has a problem with some inputs (e.g. the input in snippet)! There must be some way to optimize for these kind of inputs?

<dependency>
  <groupId>dk.brics</groupId>
  <artifactId>automaton</artifactId>
  <version>1.12-1</version>
</dependency>

??????????????????????????
*a*n*
*i*x*
*q*v*
*u*l*
*p*n*
*j*c*
*j*h*
*q*h*
*p*w*
*p*v*
*r*s*
*j*x*
*i*j*
*t*h*
*u*x*
*i*l*
*k*s*
*u*j*
*p*o*
*p*r*
*a*z*
*t*f*
*b*c*
*e*g*
*l*z*
*d*c*
*u*g*
*l*g*
*r*z*
*y*b*
*g*c*
*t*h*
*w*f*
*i*e*
*d*g*
*r*m*
*e*y*
*q*n*
*p*n*
*y*a*
*q*n*
*l*j*
*n*v*
*p*b*
*h*m*
*r*b*
*p*i*
*u*b*
*e*z*
*d*b*
*p*a*
*s*d*
*d*z*
*k*x*
*o*f*
*s*g*
*o*l*
*t*g*
*p*v*
*j*z*
*y*n*
*o*b*
*k*g*
*i*d*
*c*v*
*q*m*
*e*k*
*w*j*
*i*f*
*i*t*
*i*b*
*i*k*
*p*w*
*a*x*
*p*z*
*r*v*
*y*c*
*i*b*
*n*v*
*g*v*
*u*k*
*w*i*
*e*f*
*l*s*
*t*v*
*y*d*
*p*e*
*h*b*
*s*f*
*o*x*
*i*z*
*q*e*
*r*c*
*k*h*
import dk.brics.automaton.Automaton;
import dk.brics.automaton.RegExp;

public class Safe {
  public static void main(String[] args) {
    String[] patterns = new String[3];
    patterns[0] = "?a*b";
    patterns[1] = "a*b*";
    patterns[2] = "*a??a*";

    //aaaab
    System.out.println(new Safe().solution(patterns));

    patterns = new String[3];
    patterns[0] = "*a*b*";
    patterns[1] = "*c*d*";
    patterns[2] = "*e*f*";

    //abcdef
    System.out.println(new Safe().solution(patterns));

    patterns = new String[3];
    patterns[0] = "*p?qp?*bd*pd?q*";
    patterns[1] = "*qp*d?b*?p?d*";
    patterns[2] = "?*d?b???q*q?p*";

    //p?qpd?bdpdqqppd
    //paqpdabdpdqqppd
    System.out.println(new Safe().solution(patterns));

    patterns = new String[2];
    patterns[0] = "*z*y*y*z*x*x*x*z*x*z*y*z*y*x*x*y*y*y*z*x*y*z*y*x*x*x*z*x*z*z*z*y*y*z*x*y*z*";
    patterns[1] = "*y*z*z*x*x*y*z*y*z*y*x*z*y*z*y*x*z*y*x*y*x*y*y*z*x*y*z*x*x*z*y*z*y*y*x*z*y*";

    //yzyyzxxxyzyzyxzyzyxzyxyxyyzxyzyxxxzxyzzzyyxzxyz
    System.out.println(new Safe().solution(patterns));
  }

  public String solution(String[] patterns){
    Automaton a = null;
    for(String pattern : patterns){
      RegExp r = convertToRegExp(pattern);
      if(a == null) {
        a = r.toAutomaton();
      }
      else{
        a = a.intersection(r.toAutomaton());
      }
    }
    return a.getShortestExample(true);
  }

  private RegExp convertToRegExp(String pattern){
    return new RegExp(replaceQuestionMarks(replaceStars(pattern)).trim());
  }

  private String replaceStars(String pattern){
    return pattern.replaceAll("\\*+", "[a-z]*");
  }

  private String replaceQuestionMarks(String pattern){
    return pattern.replace("?", "[a-z]{1}");
  }
}
Bojan Vukasovic
  • 2,054
  • 22
  • 43
0

OK, I even managed to create NFA, then convert to DFA and then to construct intersection of all DFAs and in the same time visit them using BFS. It works fast for 3 first test code samples, but last two seems to be too complex for algorithm to process. If i use A*, I can get to some solution, but problem here is that it is not shortest one!

import java.util.*;
import java.util.stream.Collectors;

public class Safe4 {
    public static void main(String[] args) {
        String[] patterns = new String[3];
        patterns[0] = "?a*b";
        patterns[1] = "a*b*";
        patterns[2] = "*a??a*";

        //aaaab
        System.out.println(new Safe4().solution(patterns));

        patterns = new String[3];
        patterns[0] = "*p?qp?*bd*pd?q*";
        patterns[1] = "*qp*d?b*?p?d*";
        patterns[2] = "?*d?b???q*q?p*";

        //ppqpdpbdpdqqppd
        System.out.println(new Safe4().solution(patterns));


        patterns = new String[2];
        patterns[0] = "*z*y*y*z*x*x*x*z*x*z*y*z*y*x*x*y*y*y*z*x*y*z*y*x*x*x*z*x*z*z*z*y*y*z*x*y*z*";
        patterns[1] = "*y*z*z*x*x*y*z*y*z*y*x*z*y*z*y*x*z*y*x*y*x*y*y*z*x*y*z*x*x*z*y*z*y*y*x*z*y*";

        //yzyyzxxxyzyzyxzyzyxzyxyxyyzxyzyxxxzxyzzzyyxzxyz
        System.out.println(new Safe4().solution(patterns));

        patterns = new String[5];
        patterns[0] = "*i*o*?*?*u?*??*e?o*e*a*a*i*ee*";
        patterns[1] = "*e*?ue*o*i*?*e*u*i*?*oa?*??*";
        patterns[2] = "*?oi*i??uu*a*iu*?*?a*u*ia*u*";
        patterns[3] = "*o*e*ea?*eu*?e?*ea*u*u?u*iu*";
        patterns[4] = "*u*?oe*u*e?*e??ou?*oa*?*i*?a?*";

        //eoiueoeaiueuueeeaouuiueoauiaaiuee
        System.out.println(new Safe4().solution(patterns));

        patterns = new String[95];
        patterns[0] = "??????????????????????????";
        patterns[1] = "*a*n*";
        patterns[2] = "*i*x*";
        patterns[3] = "*q*v*";
        patterns[4] = "*u*l*";
        patterns[5] = "*p*n*";
        patterns[6] = "*j*c*";
        patterns[7] = "*j*h*";
        patterns[8] = "*q*h*";
        patterns[9] = "*p*w*";
        patterns[10] = "*p*v*";
        patterns[11] = "*r*s*";
        patterns[12] = "*j*x*";
        patterns[13] = "*i*j*";
        patterns[14] = "*t*h*";
        patterns[15] = "*u*x*";
        patterns[16] = "*i*l*";
        patterns[17] = "*k*s*";
        patterns[18] = "*u*j*";
        patterns[19] = "*p*o*";
        patterns[20] = "*p*r*";
        patterns[21] = "*a*z*";
        patterns[22] = "*t*f*";
        patterns[23] = "*b*c*";
        patterns[24] = "*e*g*";
        patterns[25] = "*l*z*";
        patterns[26] = "*d*c*";
        patterns[27] = "*u*g*";
        patterns[28] = "*l*g*";
        patterns[29] = "*r*z*";
        patterns[30] = "*y*b*";
        patterns[31] = "*g*c*";
        patterns[32] = "*t*h*";
        patterns[33] = "*w*f*";
        patterns[34] = "*i*e*";
        patterns[35] = "*d*g*";
        patterns[36] = "*r*m*";
        patterns[37] = "*e*y*";
        patterns[38] = "*q*n*";
        patterns[39] = "*p*n*";
        patterns[40] = "*y*a*";
        patterns[41] = "*q*n*";
        patterns[42] = "*l*j*";
        patterns[43] = "*n*v*";
        patterns[44] = "*p*b*";
        patterns[45] = "*h*m*";
        patterns[46] = "*r*b*";
        patterns[47] = "*p*i*";
        patterns[48] = "*u*b*";
        patterns[49] = "*e*z*";
        patterns[50] = "*d*b*";
        patterns[51] = "*p*a*";
        patterns[52] = "*s*d*";
        patterns[53] = "*d*z*";
        patterns[54] = "*k*x*";
        patterns[55] = "*o*f*";
        patterns[56] = "*s*g*";
        patterns[57] = "*o*l*";
        patterns[58] = "*t*g*";
        patterns[59] = "*p*v*";
        patterns[60] = "*j*z*";
        patterns[61] = "*y*n*";
        patterns[62] = "*o*b*";
        patterns[63] = "*k*g*";
        patterns[64] = "*i*d*";
        patterns[65] = "*c*v*";
        patterns[66] = "*q*m*";
        patterns[67] = "*e*k*";
        patterns[68] = "*w*j*";
        patterns[69] = "*i*f*";
        patterns[70] = "*i*t*";
        patterns[71] = "*i*b*";
        patterns[72] = "*i*k*";
        patterns[73] = "*p*w*";
        patterns[74] = "*a*x*";
        patterns[75] = "*p*z*";
        patterns[76] = "*r*v*";
        patterns[77] = "*y*c*";
        patterns[78] = "*i*b*";
        patterns[79] = "*n*v*";
        patterns[80] = "*g*v*";
        patterns[81] = "*u*k*";
        patterns[82] = "*w*i*";
        patterns[83] = "*e*f*";
        patterns[84] = "*l*s*";
        patterns[85] = "*t*v*";
        patterns[86] = "*y*d*";
        patterns[87] = "*p*e*";
        patterns[88] = "*h*b*";
        patterns[89] = "*s*f*";
        patterns[90] = "*o*x*";
        patterns[91] = "*i*z*";
        patterns[92] = "*q*e*";
        patterns[93] = "*r*c*";
        patterns[94] = "*k*h*";

        //pqowieurytlaksjdhfgmznxbcv
        System.out.println(new Safe4().solution(patterns));
    }

    class NFA {
        Map<Integer, Map<Character, Set<Integer>>> table;
        int initState;
        int lastState;

        public NFA(Map<Integer, Map<Character, Set<Integer>>> table, int initState, int lastState) {
            this.table = table;
            this.initState = initState;
            this.lastState = lastState;
        }
    }

    /**
     * Create NFA from input; e.g. if input is "?a*b" - here we have alphabet of "a", "b", * - any character zero or more times (represented
     * in function as 'X') and ? (means also any character but has to be there once at least - 'X')
     * <p>
     * so for input "?a*b" this gives table
     * _STATE_|__a___b___X______________
     * 0   |  1   1   1
     * 1   |  2
     * 2   |  2  2,3  2
     * 3   |
     * <p>
     * here, 3 is FINAL state and 0 is start state
     */
    private NFA createNFA(Set<Character> chars, String p) {
        int state = 0;
        boolean wasStar = false;
        Map<Integer, Map<Character, Set<Integer>>> nfa = new HashMap<>();
        for (char c : p.toCharArray()) {
            if (c == '*') {
                Map<Character, Set<Integer>> m1 = nfa.computeIfAbsent(state, k -> new HashMap<>());
                for (char c1 : chars) {
                    m1.computeIfAbsent(c1, k -> new TreeSet<>()).add(state);
                }
                m1.computeIfAbsent('X', k -> new TreeSet<>()).add(state);
                wasStar = true;
                state--;
            } else if (c == '?') {
                if (wasStar) {
                    Map<Character, Set<Integer>> m1 = nfa.get(state);
                    for (char c1 : chars) {
                        m1.computeIfAbsent(c1, k -> new TreeSet<>()).add(state);
                    }
                    m1.computeIfAbsent('X', k -> new TreeSet<>()).add(state);
                }
                Map<Character, Set<Integer>> m1 = nfa.computeIfAbsent(state, k -> new HashMap<>());
                for (char c1 : chars) {
                    m1.computeIfAbsent(c1, k -> new TreeSet<>()).add(state + 1);
                }
                m1.computeIfAbsent('X', k -> new TreeSet<>()).add(state + 1);
                wasStar = false;
            } else {
                if (wasStar) {
                    nfa.get(state).get(c).add(state);
                }
                Map<Character, Set<Integer>> m1 = nfa.computeIfAbsent(state, k -> new HashMap<>());
                m1.computeIfAbsent(c, k -> new TreeSet<>()).add(state + 1);
                wasStar = false;
            }
            state++;
        }
        return new NFA(nfa, 0, state);
    }

    class DFA {
        Map<Integer, Map<Character, Integer>> table;
        int initState;
        Set<Integer> lastStates;

        public DFA(Map<Integer, Map<Character, Integer>> table, int initState, Set<Integer> lastStates) {
            this.table = table;
            this.initState = initState;
            this.lastStates = lastStates;
        }
    }

    /**
     * Convert NFA to DFA (nondeterministic finite automata to deterministic)
     * <p>
     * e.g. for NFA
     * <p>
     * _STATE_|__a___b___X______________
     * 0   |  1   1   1
     * 1   |  2
     * 2   |  2  2,3  2
     * 3   |
     * <p>
     * we convert to DFA
     * <p>
     * _STATE_|__a___b___X______________
     * 0   |  1   1   1
     * 1   |  2
     * 2   |  2   4   2
     * 4   |  2   4   2
     * <p>
     * here, 4 is FINAL state and 0 is start state
     */
    private DFA convertToDFA(Set<Character> chars, NFA nfa) {
        Set<Character> charsWithX = new TreeSet<>(chars);
        charsWithX.add('X');
        Map<Integer, Map<Character, Integer>> dfa = new HashMap<>();
        Queue<Integer> sq = new LinkedList<>();
        int newUsableState = nfa.lastState + 1;
        Set<Integer> lastStates = new TreeSet<>();
        Map<Integer, Set<Integer>> newStateToStates = new HashMap<>();
        Map<Set<Integer>, Integer> statesToNewState = new HashMap<>();
        sq.offer(0);
        Integer currS;
        while ((currS = sq.poll()) != null) {
            Map<Character, Integer> m1 = dfa.computeIfAbsent(currS, k -> new HashMap<>());
            if (currS <= nfa.lastState) {
                //state exists in nfa
                if (nfa.table.get(currS) != null) {
                    for (Map.Entry<Character, Set<Integer>> charToStates : nfa.table.get(currS).entrySet()) {
                        Integer newState;
                        if (charToStates.getValue().size() == 1) {
                            newState = charToStates.getValue().stream().findAny().get();
                            if (newState == nfa.lastState) {
                                lastStates.add(newState);
                            }
                        } else {
                            newState = newUsableState++;
                            newStateToStates.putIfAbsent(newState, charToStates.getValue());
                            statesToNewState.putIfAbsent(charToStates.getValue(), newState);
                            if (charToStates.getValue().contains(nfa.lastState)) {
                                lastStates.add(newState);
                            }
                        }
                        m1.putIfAbsent(charToStates.getKey(), newState);
                        if (!dfa.containsKey(newState) && !sq.contains(newState)) {
                            sq.offer(newState);
                        }
                    }
                } else {
                    //this is final state
                }
            } else {
                //new state
                Set<Integer> li = newStateToStates.get(currS);
                for (Character c : charsWithX) {
                    Set<Integer> states = new TreeSet<>();
                    for (Integer i : li) {
                        Map<Character, Set<Integer>> maybeFinal = nfa.table.get(i);
                        if (maybeFinal != null) {
                            Set<Integer> exisStat = maybeFinal.get(c);
                            if (exisStat != null) {
                                states.addAll(exisStat);
                            }
                        }
                    }
                    if (!states.isEmpty()) {
                        Integer newState;
                        if (states.size() == 1) {
                            newState = states.stream().findAny().get();
                            if (newState == nfa.lastState) {
                                lastStates.add(newState);
                            }
                        } else {
                            newState = statesToNewState.get(states);
                            if (newState == null) {
                                newState = newUsableState++;
                                newStateToStates.putIfAbsent(newState, states);
                                statesToNewState.putIfAbsent(states, newState);
                                if (states.contains(nfa.lastState)) {
                                    lastStates.add(newState);
                                }
                            }
                        }
                        m1.putIfAbsent(c, newState);
                        if (!dfa.containsKey(newState) && !sq.contains(newState)) {
                            sq.offer(newState);
                        }
                    }
                }
            }
        }
        return new DFA(dfa, 0, lastStates);
    }

    private Set<Character> chars(String p) {
        Set<Character> chars = new TreeSet<>();
        for (char c : p.toCharArray()) {
            if (c > 96 && c < 123) {
                chars.add(c);
            }
        }
        return chars;
    }

    class RecNodeMulti {
        final List<Integer> newState;
        final char rec;
        final RecNodeMulti parent;

        public RecNodeMulti(List<Integer> newState, char rec, RecNodeMulti parent) {
            this.newState = newState;
            this.rec = rec;
            this.parent = parent;
        }

        public String toRoot() {
            StringBuilder sb = new StringBuilder();
            RecNodeMulti rn = this;
            while (rn.rec != Character.MIN_VALUE) {
                sb.append(rn.rec);
                rn = rn.parent;
            }
            return sb.reverse().toString();
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) return true;
            if (o == null || getClass() != o.getClass()) return false;
            RecNodeMulti that = (RecNodeMulti) o;
            return rec == that.rec &&
                    Objects.equals(newState, that.newState);
        }

        @Override
        public int hashCode() {
            return Objects.hash(newState, rec);
        }
    }

    RecNodeMulti root;

    /**
     * Intersects dfa1,dfa2,... dfaN and visit intersecting nodes one by one in BFS manner
     * <p>
     * e.g. for 2 DFAs like dfa1 representing "?a*b"
     * <p>
     * _STATE_|__a___b___X______________
     * 0   |  1   1   1
     * 1   |  2
     * 2   |  2   4   2
     * 4   |  2   4   2
     * <p>
     * and dfa2 representing "a*b*"
     * <p>
     * _STATE_|__a___b___X______________
     * 0   |  1
     * 1   |  1   3   1
     * 3   |  3   3   3
     * <p>
     * we FOLLOW BFS states representing representing "aa*b*"
     * <p>
     * _STATE_|__a___b___X______________
     * 5   |  6
     * 6   |  7
     * 7   |      8   7
     * 8   |      8   9
     * 9   |      8   9
     * <p>
     * where final state is 8
     */
    private String multiIntersectDFAWithBFS(Set<Character> charsSvi, List<DFA> dfaS) {
        Set<Character> charsSviWithX = new TreeSet<>(charsSvi);
        //charsSviWithX.add('X');
        Set<RecNodeMulti> visited = new HashSet<>();
        Queue<RecNodeMulti> sq = new LinkedList<>();
        root = new RecNodeMulti(dfaS.stream().map(x -> x.initState).collect(Collectors.toList()), Character.MIN_VALUE, null);
        sq.offer(root);
        RecNodeMulti currS;
        Map<List<Integer>, Integer> statesToNewState = new HashMap<>();
        int newUsableState = dfaS.stream().mapToInt(x -> x.table.keySet().stream().mapToInt(y -> y).max().getAsInt()).max().getAsInt() + 1;
        while ((currS = sq.poll()) != null) {

            List<Map<Character, Integer>> statemap = new ArrayList<>();
            for (int i = 0; i < dfaS.size(); i++) {
                statemap.add(dfaS.get(i).table.get(currS.newState.get(i)));
            }

            for (char c1 : charsSviWithX) {
                List<Integer> sIns = statemap.stream().map(x -> {
                    Integer in = x.get(c1);
                    if (in == null) {
                        return x.get('X');
                    }
                    return in;
                }).collect(Collectors.toList());

                if (sIns.stream().allMatch(Objects::nonNull)) {
                    Integer exisP = statesToNewState.get(sIns);
                    if (exisP == null) {
                        exisP = newUsableState++;
                        statesToNewState.put(sIns, exisP);
                        boolean allLast = true;
                        for (int i = 0; i < dfaS.size(); i++) {
                            if (!dfaS.get(i).lastStates.contains(sIns.get(i))) {
                                allLast = false;
                                break;
                            }
                        }
                        if (allLast) {
                            //newS3 is LAST STATE!!!!!!
                            return currS.toRoot() + c1;
                        }
                    } else {
                        //already known state; we do not want to go back
                        continue;
                    }
                    RecNodeMulti rnm = new RecNodeMulti(sIns, c1, currS);
                    if (!sq.contains(rnm) && !visited.contains(rnm)) {
                        visited.add(rnm);
                        sq.offer(rnm);
                    }
                }
            }
        }

        return null;
    }

    private String solution(String[] patterns) {

        List<DFA> dfaS = new ArrayList<>();
        Set<Character> charsSvi = new TreeSet<>();

        for (String p : patterns) {
            Set<Character> chars = chars(p);
            charsSvi.addAll(chars);
            NFA nfa = createNFA(chars, p);
            DFA dfa = convertToDFA(chars, nfa);
            dfaS.add(dfa);
        }

        return multiIntersectDFAWithBFS(charsSvi, dfaS);
    }
}
Bojan Vukasovic
  • 2,054
  • 22
  • 43
  • That 95 pattern example is going to kill a DFA construction because the number of possible subsets of partial matches is exponential in the number of patterns... – btilly Feb 13 '20 at 20:41