I'm working with suffix trees. As far as I can tell, I have Ukkonen's algorithm running correctly to build a generalised suffix tree from an arbitrary number of strings. I'm now trying to implement a find_longest_common_substring()
method to do exactly that. For this to work, I understand that I need to find the deepest shared edge (with depth in terms of characters, rather than edges) between all strings in the tree, and I've been struggling for a few days to get the traversal right.
Right now I have the following in C++. I'll spare you all my code, but for context, I'm keeping the edges of each node in an unordered_map called outgoing_edges
, and each edge has a vector of ints recorded_strings
containing integers identifying the added strings. The child
field of an edge is the node it is going to, and l
and r
identify its left and rightmost indices, respectively. Finally, current_string_number
is the current number of strings in the tree.
SuffixTree::Edge * SuffixTree::find_deepest_shared_edge(SuffixTree::Node * start, int current_length, int &longest) {
Edge * deepest_shared_edge = new Edge;
auto it = start->outgoing_edges.begin();
while (it != start->outgoing_edges.end()) {
if (it->second->recorded_strings.size() == current_string_number + 1) {
int edge_length = it->second->r - it->second->l + 1;
int path_length = current_length + edge_length;
find_deepest_shared_edge(it->second->child, path_length, longest);
if (path_length > longest) {
longest = path_length;
deepest_shared_edge = it->second;
}
}
it++;
}
return deepest_shared_edge;
}
When trying to debug, as best I can tell, the traversal runs mostly fine, and correctly records the path length and sets longest. However, for reasons I don't quite understand, in the innermost conditional, deepest_shared_edge
sometimes seems to get updated to a mistaken edge. I suspect I maybe don't quite understand how it->second
is updated throughout the recursion. Yet I'm not quite sure how to go about fixing this.
I'm aware of this similar question, but the approach seems sufficiently different that I'm not quite sure how it applies here.
I'm mainly during this for fun and learning, so I don't necessarily need working code to replace the above - pseudocode or just any explanation of where I'm confused would be just as well.