4

I tried to construct a suffix tree based on Mark Nelson’s implementation of Ukkonen’s algorithm in java code, which is a variant of the code at: http://www.sanfoundry.com/java-program-implement-suffix-tree/

The following code constructs a compact suffix tree (compressed suffix trie) from scanning a text file containing the word "minimum" spaced out in the text file like this:

min
im
ize    

The suffix tree is compressed into array list form based on Ukkonen's algorithm using Edge-Label compression, so that all suffixes can be referenced from the index of a single array representation.

The code also prints out all the contents and details of the suffix tree as follows:

Start  End  Suf  First Last  String

  0    10  -1     7      7   e
  0     4   0     1      1   i
  0     6   4     0      1   mi
  0     3  -1     2      7   nimize
  0     9  -1     6      7   ze
  4     5  -1     4      7   mize
  4     2  -1     2      7   nimize
  4     8  -1     6      7   ze
  6     1  -1     2      7   nimize
  6     7  -1     6      7   ze

The constructor I used is the following, changed from the current constructor in Mark Nelson's java code for his Java implementation of Ukkonen's algorithm in the link above, but the rest of his code remains intact:

    public CompressedSuffixTrie(String f) // Create a compact compressed suffix trie from file f
{
    Edges = new Edge[ HASH_TABLE_SIZE ];
    for (int i = 0; i < HASH_TABLE_SIZE; i++)
        Edges[i] = new Edge();
    Nodes = new Node[ MAX_LENGTH * 2 ];
    for (int i = 0; i < MAX_LENGTH * 2 ; i++)
        Nodes[i] = new Node();
    active = new Suffix( 0, 0, -1 );

    // Create new Scanner to scan file
    Scanner s;
    try {
        s = new Scanner(new File(f + ".txt"));

        // ArrayList to be converted into String
        ArrayList<String> arraylist = new ArrayList<String>();
        // Add every line of text containing sequence from file f into ArrayList
        while (s.hasNextLine()){
            arraylist.add(s.nextLine());
        }
        s.close();

        // Convert ArrayList to String using advanced for-each loop and StringBuilder
        StringBuilder sb = new StringBuilder();
        for (String str : arraylist) {
            sb.append(str); // append contents to StringBuilder
        }
        String str = sb.toString(); // convert back to String

        /** Construct Suffix Tree **/       
        this.T = str.toCharArray();
        this.N = this.T.length - 1;  

        for (int i = 0 ; i <= this.N ; i++ )
            this.AddPrefix( this.active, i );
        this.dump_edges( this.N );      

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

The code seems to be working correctly but I want to be able to find the first occurrence of a pattern s in the suffix tree which returns the starting index of the first occurrence of the pattern s. For example, if s appears in the suffix tree, findString(s) will return the starting index of the first occurrence of s in the suffix tree. Otherwise, it will return –1.

Is there a way to do that such that the findString(s) method is not slower than time complexity O(|s|) where |s| is the length of s?

iteong
  • 715
  • 3
  • 10
  • 26

1 Answers1

0

If you are talking about parsing the output of the suffix tree print, then this should return each item that doesn't have -1 under the Suf column:

(\n?)\s+\d+\s+\d+\s+(?!-1)([\d-]+)\s+\d+\s+\d+\s+(\w+)(\n?)

Group 2 has the index and Group 3 has the suffix.

If you're not talking about parsing the printed output, then the question is not really about regex or pattern matching, more about how to walk your trie.

VolatileRig
  • 2,727
  • 6
  • 32
  • 44