0

I've sequences builded from 0's and 1's. I want to somehow measure their distance from target string. But target string is incomplete.

Example of data I have, where x is target string, where [0] means the occurance of at least one '0' :

x =11[0]1111[0]1111111[0]1[0]`, the length of x is fixed and eaquel to length of y.

y1=11110111111000000101010110101010111

y2=01101000011100001101010101101010010
all y's have the same length

it's easy to see that x could be indeed interpreted as set of strings, but this set could be very large, mayby simply I need to sample from that set and take average of minimum edit distances, but again it's too big computional problem.

I've tried to figure out algo, but I'm stacked, it steps look like this : x - target string - fuzzy one,

y - second string - fixed Cx1, Cy1 - numbers of ones in x and y Gx1, Gy1 - lists of vectors, length of each list is equal to number of groups of ones in given sequence,

Gx1[i] i-th vector,

Gx1[i]=(first one in i-th group of ones, length of i-th group of ones)

if lengths of Gx1 and Gy1 are the same then we know how many ones to add or remove from each group, but there's a problem, because I don't know if simple adding and removing gives minimum distance

Qbik
  • 5,885
  • 14
  • 62
  • 93
  • Two questions: (1) Do the 0s in x _always_ appear as `[0]`, or can it happen that a single `0` appears? (2) If, for example, x=`1[0]11`, and y=`100011`, would that be an exact match, i.e. edit distance zero? – jogojapan Apr 21 '12 at 11:47
  • yes that would be the exact match – Qbik Apr 21 '12 at 12:44
  • You've only stated that you want a measure of their distance. I take this to mean that you might be happy with any one of several kinds of edit distances, and you mention that the average minimum edit distance would be useful, but also would you be happy if an algorithm only told you the minimum minimum edit distance, or the maximum minimum edit distance? – Running Wild Apr 21 '12 at 15:44

2 Answers2

1

Let (Q, Σ, δ, q0, F) be the target automaton, which accepts a regular language L ⊆ Σ*, and let w ∈ Σ* be the source string. You want to compute minx ∈ L d(x, w), where d denotes Levenshtein distance.

My approach is to generalize the usual dynamic program. Let D be a table indexed by Q × {0, …, |w|}. At the end of the computation, D(q, i) will be

minx : δ(q0, x) = q d(x, w[0…i]),

where w[0…i] denotes the length-(i + 1) prefix of w. In other words, D(q, i) is the distance between w[0…i] and the set of strings that leave the automaton in state q. The overall answer is

minq ∈ F D(q, |w|),

or the distance between w and the set of strings that leave the automaton in one of the final states, i.e., the language L.


The first column of D consists of the entries D(q, 0) for every state q ∈ Q. Since for every string x ∈ Σ* it holds that d(x, ε) = |x|, the entry D(q, 0) is the length of the shortest path from q0 to q in the graph defined by the transition function δ. Compute these entries by running "Dijkstra's algorithm" from q0 (actually just breadth-first search because the edge-lengths are all 1).

Subsequent columns of D are computed from the preceding column. First compute an auxiliary quantity D'(q, i) by minimizing over several possibilities.

Exact match For every state r ∈ Q such that δ(r, w[i]) = q, include D(r, i - 1).

Deletion Include D(q, i - 1) + 1.

Substitution For every state r ∈ Q and every letter a ∈ Σ ∖ {w[i]} such that δ(r, a) = q, include D(r, i - 1) + 1.

Note that I have left out Insertion. As with the first column, this is because it may be necessary to insert many letters here. To compute the D(i, q)s from the D'(i, q)s, run Dijkstra on an implicit graph with vertices Q ∪ {s} and, for every q ∈ Q, edges of length D'(i, q) from the super-source s to q and, for every q ∈ Q and a ∈ Σ, edges of length 1 from q to δ(q, a). Let D(i, q) be the final distances.


I believe that this algorithm, if implemented well (with a heap specialized to support Dijkstra with unit lengths), has running time O(|Q| |w| |Σ|), which, for small alphabets Σ, is comparable to the usual Levenshtein DP.

zxc
  • 191
  • 2
  • Isn't Dijkstra's with unit lengths just BFS, and don't you use a vector rather than a heap? – Running Wild Apr 21 '12 at 18:24
  • The super-source version has non-unit length edges from the super-source, so it's not really BFS, but the underlying collection can be made similarly efficient. I say Dijkstra when describing the computation of D(., 0) because, for whatever reason, some people don't associate BFS with shortest paths. – zxc Apr 21 '12 at 18:32
0

I would propose that you use dynamic programming for this one. The dp is two dimensional:xi - the index in the xpattern string you are in and yi - the index in the y string you are in and the value for each subproblem is the minimum edit distance between the substrings x[xi..x.size-1] and y[yi...y.size-1].

Here is how you can find the minimum edit distance between a x pattern given as you explain an a fixed y string. I will assume that the symbol @ in the x-pattern means any positive number of zeros. Also I will use some global variables to make the code easier to read.

#include <iostream>
#include <string>
using namespace std;


const int max_len = 1000;
const int NO_SOLUTION = -2;
int dp[max_len][max_len];

string x; // pattern;
string y; // to compute minimum edit distance to
int solve(int xi /* index in x */, int yi /* index in y */) {
  if (yi + 1 == y.size()) {
    if (xi + 1 != x.size()) {
      return dp[xi][yi] = NO_SOLUTION;
    } else {
      if (x[xi] == y[yi] || (y[yi] == '0' && x[xi] == '@')) {
        return dp[xi][yi] = 0;
      } else {
        return dp[xi][yi] = 1; //  need to change the character 
      }
    }
  }
  if (xi + 1 == x.size()) {
    if (x[xi] != '@') {
      return dp[xi][yi] = NO_SOLUTION;
    }
    int number_of_ones = 0;
    for (int j = yi; j < y.size(); ++j) {
      if (y[j] == '1') {
        number_of_ones++;
      }
    }
    return dp[xi][yi] = number_of_ones;
  }
  int best = NO_SOLUTION;
  if (x[xi] != '@') {
    int temp = ((dp[xi + 1][yi + 1] == -1)?solve(xi + 1, yi +1):dp[xi + 1][yi +1]);
    if (temp != NO_SOLUTION && x[xi] != y[yi]) {
      temp++;
    }
    best = temp;
  } else {
    int temp = ((dp[xi + 1][yi + 1] == -1)?solve(xi + 1, yi +1):dp[xi + 1][yi +1]);
    if (temp != NO_SOLUTION) {
      if (y[yi] != '0') {
        temp++;
      }
      best = temp;
    }

    int edit_distance = 0; // number of '1' covered by the '@'

    // Here i represents the number of chars covered by the '@'
    for (int i = 1; i < y.size(); ++i) {
      if (yi + i >= y.size()) {
        break;
      }
      int temp = ((dp[xi][yi + i] == -1)?solve(xi, yi + i):dp[xi][yi + i]);
      if (temp == NO_SOLUTION) {
        continue;
      }
      if (y[yi] != '0') {
        edit_distance++;
      }
      temp += edit_distance;
      if (best == NO_SOLUTION || temp < best) {
        best = temp;
      }
    }
  }
  return best;
}

int main() {
  memset(dp, -1, sizeof(dp));
  cin >> x >> y;
  cout << "Minimum possible edit distance is: " << solve(0,0) << endl;
  return 0;
}

Hope this helps.

Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176