-1

An answer to this question cites page 340 of "Text algorithms" by Crochemore and Rytter for a linear-time algorithm to compute the period of a string. However it's quite complex, and the following, adapted from the maximal suffix algorithm used in the Two Way algorithm (by Chrochemore and Perrin), seems correct for computing the period:

size_t period_of(const char *x)
{
    size_t j=1, k=0, p=1;
    while (x[j+k]) {
        if (x[j+k] != x[k]) {
            j += k?k:1;        // Previously: j += k+1;
            k = 0;
            p = j;
        } else if (k != p) {
            k++;
        } else {
            j += p;
            k = 0;
        }
    }
    return p;
}

Their version in Two Way, from which this is adapted, computes the period of the maximal suffix as a side effect of computing the maximal suffix. However, unless I'm missing something, the validity of the logic does not seem to depend on the maximal suffix property.

Is the above correct? If not, can you provide a counterexample that shows where it fails?

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I'm not sure if this is the best site for this question. Would computerscience.stackexchange.com be better? – Daniel H Nov 13 '18 at 03:54
  • 4
    @DanielH while computerscience might be better, this is still an ontopic question here. Imho – bolov Nov 13 '18 at 03:55
  • 1
    FWIW, cs.SE is virtually useless for getting help because it's so low-traffic. With the question implementation written in C, I think it also makes more sense to ask here where I can tag it C and expect people who read it to know the language it's written in, rather than worry about language barriers. – R.. GitHub STOP HELPING ICE Nov 13 '18 at 04:00
  • 1
    As a side note, the behavior of the function is undefined if `x` points to an empty string. Not sure if this is expected. – DYZ Nov 13 '18 at 04:15
  • @DYZ: Indeed, I'll fix that as soon as I can test a fix, but it's not important to high level correctness questions. – R.. GitHub STOP HELPING ICE Nov 13 '18 at 04:18
  • 2
    what is considered the period of `AEITQAVAA`? `AEITQAVA` or `AEITQAVAA`? Because I think it should be `AEITQAVA` but your algorithm says `AEITQAVAA`. – bolov Nov 13 '18 at 06:11
  • @bolov: Indeed, the intent is the former and that's what I expected the algorithm to yield, so I need to check and see whether this is just a stupid bug or something fundamental. – R.. GitHub STOP HELPING ICE Nov 13 '18 at 15:47
  • @bolov: I think it's a fixable error - rather than `j += k+1`, it should be `j += k?k:1`. Otherwise, start of a new period in the middle of a false repetition is missed. This creates another case where `j+k` does not increment on loop iteration, but it can only happen once without an intervening increment, so it doesn't affect linear time. Updating source with a comment on the fix. – R.. GitHub STOP HELPING ICE Nov 14 '18 at 01:49
  • @bolov: In some sense your comment did lead to an answer in the negative, I think. – R.. GitHub STOP HELPING ICE Nov 14 '18 at 04:05
  • @R.. found the example by brute force. If you want I can provide the code and you can test modifications of the algorithm yourself. It's in C++ though. – bolov Nov 14 '18 at 04:09
  • @bolov: See the answer I posted, which I think settles it... – R.. GitHub STOP HELPING ICE Nov 14 '18 at 04:18

1 Answers1

0

A counterexample even after the fix is:

aabaaaba

At position 3, the algorithm first starts matching the prefix of a period-3 match. But when it gets an a instead of a b at position 5, it wrongly jumps the candidate period up to 5, missing the actual period of 4.

The algorithm in Two Way, from which this was adapted, that computes the period of the maximal suffix as a side effect of finding the maximal suffix, does rely on the maximal suffix property. Instead of the != condition, it has separate > and < conditions, where one of the two will replace the start of the candidate suffix and the other will extend the running period. Stated non-rigorously, the above kind of situation cannot arise, because either b > a, in which case the suffix would start at a b, or a > b, in which case aaa > aab. I suspect reading the paper in more detail (which requires deciphering its awful pseudocode notation with one-based indexing) will clarify the rest.

Unfortunately I'm fairly convinced that the algorithm I asked about is not recoverable.

Further note that in the example, a need not be a single character. It can be an arbitrarily-long pattern. This seems to preclude any trivial linear-time fixups.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711