8

I'm looking for a short, simple suffix tree building/usage algorithm in Java. The best I've found so far lies withing the Semantic Discovery Toolkit, but the implementation is several thousand lines long and spans several classes. Ideally, the implementation would be as short as possible and span no more than a few hundred lines.

Does anyone have such an implementation?

Stefan Kendall
  • 66,414
  • 68
  • 253
  • 406
  • no, but i wrote one in ruby a while back. you should probably just write it yourself if you want a short implementation... char[] c = string.toCharArray(); for(int i=c.length-1; i>=0; i++) recurse(c[i])... – twolfe18 Jan 11 '10 at 15:47
  • Post it as an answer so I can upvote it. I just need something that fits on a sheet of paper that I can reference easily. Shortly, I will need to be able to produce a number of algorithms with minimal documentation, so short implementations are good implementations. – Stefan Kendall Jan 11 '10 at 22:36

3 Answers3

5

I just finished a Java implementation of a suffix tree. In my blog entry you can find out more about suffix trees, see how to use my library, as well as download and build the library using Subversion and Maven. Yes, it's longer than just a few lines in a single class file, but it is highly documented and is created for use in the real world for practical purposes. In addition, it uses the Ukkonen approach for linear time construction. (Most of the implementations noted here have at least O(n^2) running time.)

Garret Wilson
  • 18,219
  • 30
  • 144
  • 272
  • +1 Although the OP did not specify scalability/performance as criteria, those are nearly always for me; therefore, it is important to get linear time - and thus Uknonnen's approach. When including those criteria, this is a quality answer. – WestCoastProjects Sep 08 '13 at 18:54
1

The article "Simple Linear Work Suffix Array Construction", by Karkkainen and Sanders, terminates with 50 lines of C++. You will probably also want something to produce the LCP array. Googling for "Computing the LCP array in linear time, given S and the suffix array POS." should find you that.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
0

You can also take mine but this is not Ukkonen's algorithm - as all other simple approaches, it runs in quadratic time. I agree that a naive algorithm (that may work ok for the shorter sequences) is easy to write in half a day at most.