3

I'm working on an implementation of Ukkonen's linear time suffix tree construction algorithm, and planning to implement improvements suggested by e.g. Kurtz and NJ Larsson (for example edge links instead of suffix links).

While testing, I experienced mixed green and red lights based on the specific strings I tested, and had similar experiences with a few algorithms I found online. Which made me wonder:

  • Are there any known, specifically built (preferably simple/short) strings for unit-testing suffix trees to ensure the algorithm works precisely in all branching scenarios?

  • Furthermore, are there any good methods to separate the testing of the tree building algorithm from the testing of the traversal/lookup algorithm?

I know this question doesn't have a single specific correct answer, but I think it could serve as a good reference point for people working on similar algorithms.

My current unit-testing approach is quite primitive (C# with NUnit):

[TestCase]
public void Contains_Simple_ShouldReturnTrue()
{
    var s = "bananasbanananananananananabananas";
    var st = SuffixTree.Build(s);

    var t1 = s.Substring(0, 10);
    Assert.IsTrue(st.Contains(t1));
}

// ... Other simple test cases


[TestCase]
// This test fails, but it's not particularly helpful for bugfixing
public void Contains_DynamicBarrage_OnLongString_ShouldReturnTrue()
{
    const int   CYCLES = 200,
                MAXLEN = 200;

    var s = "olbafuynhfcxzqhnebecxjrfwfttw"; // Shortened for sanity
    var st = SuffixTree.Build(s);
    var r = new Random();

    for (int i = 0; i < CYCLES; i++)
    {
        var pos = r.Next(0, s.Length - 2);
        var len = r.Next(1, Math.Min(s.Length - pos, MAXLEN));

        Assert.IsTrue(st.Contains(s.Substring(pos, len)));
    }
}
Leaky
  • 3,088
  • 2
  • 26
  • 35
  • 2
    To verify that all branching conditions are properly tested, I would use a test coverage tool in combination with my unit test suite, expanding the test suite until (close to) 100 percent code coverage is achieved. -- As for that last test case you show, instead of doing CYCLES random samples, why not test for every possible sub-string explicitly? Personally, I'm not in favor of using Random in unit tests - how to reproduce a failure? – 500 - Internal Server Error Apr 17 '19 at 11:29
  • 1
    +1 for the test coverage tool, especially in the case of Ukkonen's algorithm, because despite its apparent simplicity, finding strings that will cover all the different cases (especially with the behavior of suffix links) is going to be laborious. I'd also go for tests with randomly built strings (on top of careful selected ones of course), saving the seed when a test fails or the coverage grows. – m.raynal Apr 17 '19 at 12:04
  • 1
    Thanks, the test coverage tool sounds like a great idea. Will need to look into this, especially to find some free to use coverage tools (the one I heard about so far is not free). I'll also try to build short strings randomly - great tip, thanks. – Leaky Apr 17 '19 at 12:08
  • 1
    Another test I thought about: a good thing to check with ukkonen's algorithm is that your actual implementation is indeed in `O(n)`, by benchmarking its execution time, and verifying that it does indeed grow linearly w.r.t the input size. It would be a pity to go though the trouble of implementing it for its linearity and fact it's online, only to realize that in the end it behaves like a suffix trie .... – m.raynal Apr 17 '19 at 14:24
  • 1
    @m.raynal that's a good point too, I'll surely include a few test cases for that as well. But it's sure as hell better than *my* suffix trie; that takes 3 GB memory and forever to build with a 10K char string. :D With the suffix tree, build for 10K chars is basically instant, and memory use is low, so that's a good indication so far. Fortunately I'll need this to perform only up to around 20k characters. – Leaky Apr 17 '19 at 14:44
  • 2
    You probably can't test this using the public interface, but I would perform an in-order traversal, and at every leaf encountered, check that the corresponding suffix is lexicographically larger than the previous one (akin to how you would check correctness of an algorithm for sorting an array of integers by comparing each to the previous). At the same time you could also check that each suffix is encountered exactly once (by incrementing a counter for that suffix; at the end, every counter should be exactly 1). – j_random_hacker Apr 17 '19 at 18:24

0 Answers0