First of all, excuse me if I write a lot, I tried to summarize my research so that everyone can understand.
R. Baeza-Yates and M. Regnier published in 1990 a new algorithm for searching a two dimensional mm pattern in a two dimensional nn text. The publication is very well written and quite understandable for a novice like me, the algorithm is described in pseudocode and I was able to implements it successfully.
One part of the BYR algorithm requires the Aho-Corasick algorithm. This allows to search occurences of multiple keywords in a string text. However, they also say that this part of their algorithm can be greatly improved by using Aho-Corasick not, but Commentz-Walter algorithm (based on Boyer-Moore rather than Knuth-Morris-Pratt algorithm). They evoke an alternative to the Commentz-Walter algorithm, alternative that they themselves developed. This is described and explained in their previous publication (see 4th chapter).
This is where my problem lies. As I said, the algorithm goes through the text and check if it contains a word from the set of keywords. The words are arranged upside down and placed in a tree. To be efficient, it will sometimes be necessary to skip a number of letters, when he knows that there is no match found.
To determine the number of characters that can be skipped, two tables d
and dd
have to be computed. Then, the algorithm is very simple:
The algorithm works as follows:
- We align the root of the trie with position m in the text, and we start matching the text from right to left following the corresponding path in the trie.
- If a match is found (final node), we output the index of the corresponding string.
- After a match or mismatch, we move the trie further in the text using the maximum of the shift associated to the current node (means dd), and the value of d[x], where x is the character in the text corresponding to the root of the trie.
- Start matching the trie again from right to left in the new position.
My problem is that I do not know how to compute the dd
function. In their publication, R. Baeza-Yates and M. Regnier propose a formal definition of it:
pi is a word among the set of keyword, j is the index of a letter in this word, so pi[j] is like a node in the previous trie I showed. Number in the node represented dd(node). L is the number of words, and mi is the number of letters in the word pi.
They give no indication concerning the construction of this function. They only recommend to watch the work of W. Rytter. This document builds a function similar to that expected, the difference being that in this case, there is only one keyword and not a set.
The definiton of dd (called D here), is as follow:
It may be noted similarities with the previous definition, but I do not understand everything.
The pseudocode for the construction of this function is given in the paper, I have implemented it, here in C++:
int pattern[] = { 1, 2, 3, 1 }; /* I use int instead of char, simpler */
const int n = sizeof(pattern) / 4;
int D[n];
int f[n];
int j = n;
int t = n + 1;
for (int k = 1; k <= n; k++){
D[k-1] = 2 * n - k;
}
while (j > 0) {
f[j-1] = t;
while (t <= n) {
if (pattern[j-1] != pattern[t-1]) {
D[t-1] = min(D[t-1], n - j);
t = f[t-1];
}
else {
break;
}
}
t = t - 1;
j = j - 1;
}
int f1[n];
int q = t;
t = n + 1 - q;
int q1 = 1;
int j1 = 1;
int t1 = 0;
while (j1 <= t) {
f1[j1 - 1] = t1;
while (t1 >= 1) {
if (pattern[j1 - 1] != pattern[t1 - 1]) {
t1 = f1[t1 - 1];
}
else {
break;
}
}
t1 = t1 + 1;
j1 = j1 + 1;
}
while (q < n) {
for (int k = q1; k <= q; k++) {
D[k - 1] = min(D[k - 1], n + q - k);
}
q1 = q + 1;
q = q + t - f1[t - 1];
t = f1[t - 1];
}
for (int i = 0; i < n; i++)
{
cout << D[i] << " ";
}
It works, but I do not know how to expand it for several words, I do not know how to coincide with the formal definition of dd
given by Baeza-Yates and Régnier. I said that the two definitions was similar, but I do not know to what extent.
I did not find any other information about their algorithm, it is impossible for me to know how to implement the construction of dd
, but I am looking for someone who could perhaps understand and show me how to get there, explaining me the link between the definitions of D
and dd
.