Understanding the Baeza-Yates Régnier algorithm (multiple string matching, extended from Boyer-Moore)

Question

First of all, excuse me if I write a lot, I tried to summarize my research so that everyone can understand.

R. Baeza-Yates and M. Regnier published in 1990 a new algorithm for searching a two dimensional mm pattern in a two dimensional nn text. The publication is very well written and quite understandable for a novice like me, the algorithm is described in pseudocode and I was able to implements it successfully.

One part of the BYR algorithm requires the Aho-Corasick algorithm. This allows to search occurences of multiple keywords in a string text. However, they also say that this part of their algorithm can be greatly improved by using Aho-Corasick not, but Commentz-Walter algorithm (based on Boyer-Moore rather than Knuth-Morris-Pratt algorithm). They evoke an alternative to the Commentz-Walter algorithm, alternative that they themselves developed. This is described and explained in their previous publication (see 4th chapter).

This is where my problem lies. As I said, the algorithm goes through the text and check if it contains a word from the set of keywords. The words are arranged upside down and placed in a tree. To be efficient, it will sometimes be necessary to skip a number of letters, when he knows that there is no match found.

Trie

To determine the number of characters that can be skipped, two tables d and dd have to be computed. Then, the algorithm is very simple:

The algorithm works as follows:

We align the root of the trie with position m in the text, and we start matching the text from right to left following the corresponding path in the trie.

If a match is found (final node), we output the index of the corresponding string.

After a match or mismatch, we move the trie further in the text using the maximum of the shift associated to the current node (means dd), and the value of d[x], where x is the character in the text corresponding to the root of the trie.

Start matching the trie again from right to left in the new position.

My problem is that I do not know how to compute the dd function. In their publication, R. Baeza-Yates and M. Regnier propose a formal definition of it:

dd function

pi is a word among the set of keyword, j is the index of a letter in this word, so pi[j] is like a node in the previous trie I showed. Number in the node represented dd(node). L is the number of words, and mi is the number of letters in the word pi.

They give no indication concerning the construction of this function. They only recommend to watch the work of W. Rytter. This document builds a function similar to that expected, the difference being that in this case, there is only one keyword and not a set.

The definiton of dd (called D here), is as follow:

D function

It may be noted similarities with the previous definition, but I do not understand everything.

The pseudocode for the construction of this function is given in the paper, I have implemented it, here in C++:

int pattern[] = { 1, 2, 3, 1 };  /* I use int instead of char, simpler */
const int n = sizeof(pattern) / 4;
int D[n];
int f[n];

int j = n;
int t = n + 1;

for (int k = 1; k <= n; k++){
    D[k-1] = 2 * n - k;
}

while (j > 0) {
    f[j-1] = t;
    while (t <= n) {
        if (pattern[j-1] != pattern[t-1]) {
            D[t-1] = min(D[t-1], n - j);
            t = f[t-1];
        }
        else {
            break;
        }
    }
    t = t - 1;
    j = j - 1;
}

int f1[n];
int q = t;
t = n + 1 - q;
int q1 = 1;
int j1 = 1;
int t1 = 0;


while (j1 <= t) {
    f1[j1 - 1] = t1;
    while (t1 >= 1) {
        if (pattern[j1 - 1] != pattern[t1 - 1]) {
            t1 = f1[t1 - 1];
        }
        else {
            break;
        }
    }
    t1 = t1 + 1;
    j1 = j1 + 1;
}
while (q < n) {
    for (int k = q1; k <= q; k++) {
        D[k - 1] = min(D[k - 1], n + q - k);
    }
    q1 = q + 1;
    q = q + t - f1[t - 1];
    t = f1[t - 1];
}

for (int i = 0; i < n; i++)
{
    cout << D[i] << " ";
}

It works, but I do not know how to expand it for several words, I do not know how to coincide with the formal definition of dd given by Baeza-Yates and Régnier. I said that the two definitions was similar, but I do not know to what extent.

I did not find any other information about their algorithm, it is impossible for me to know how to implement the construction of dd, but I am looking for someone who could perhaps understand and show me how to get there, explaining me the link between the definitions of D and dd.

I don't have anything to contribute, but it looks like you have done a very nice job of posing your question. +1. — Ira Baxter, Jun 27 '14 at 21:19
You might want to also post this over on [Computer Science](http://cs.stackexchange.com/). — 500 - Internal Server Error, Jun 27 '14 at 22:44

score 0 · Answer 1 · answered Jun 28 '14 at 05:32

I think d[x] corresponds to the bad character rule in http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm and D corresponds to the Good Suffix rule in the same article. This would mean that x in d[x] is not the character in the root of the tree, but the value of the first character in the text being searched that fails to match a child of the current node.

I think the idea is the same as Boyer-Moore. You move along the tree as long as you have a match, and when you have a mismatch you know two things: the character causing the mismatch, and the substring you have matched so far. Taking each of these things independently, you may be able to work out that if you shifted along the text being searched 1,2,..k positions you still wouldn't have a match, because at these offsets the character that caused a mismatch would still cause a mismatch, or the portion of the text that previously matched would not match at this shifted offset. So you can skip on to the first offset not ruled out by either value.

Actually, this suggests a variant scheme, in which d and DD provide not numbers but bit-masks, and you and together the two bitmaps and shift according to the position of the first bit that is still set. Presumably this doesn't save you enough to be worth the extra set-up time.

Understanding the Baeza-Yates Régnier algorithm (multiple string matching, extended from Boyer-Moore)

1 Answers1