0

Does anyone know the reason for the statement below? Or is there a better website to ask this type of question? Any pointer would be appreciated.

If a pattern occurs in a text (of length n) k times, the search of the pattern for all those k times in the suffix tree of that text would cost O(n+k).

jogojapan
  • 68,383
  • 11
  • 101
  • 131
user685275
  • 2,097
  • 8
  • 26
  • 32

2 Answers2

0

The length of time for a suffix tree search is proportional to the length of the pattern you are searching. If you build a suffix tree for Mississippi and searched for ssi. The lookups that have to be performed would be 3. The time is O(n) where n is the length of the pattern.

Justin Thomas
  • 5,680
  • 3
  • 38
  • 63
  • I know of that. But if I want to look for all occurrences of `ssi`, the time becomes O(n+2), since there are k=2 occurrences of `ssi` in `Mississippi`, anyone know of the reason? by the way, n here is the length of the text not the pattern – user685275 Apr 28 '11 at 15:01
  • I don't think so, all you are doing is trying to find if ssi exists. It sounds like more of an inverted index problem if you want to know the count of times ssi occurs. You don't have to store two separate branches for ssi in a suffix tree. You could keep a list of indexes off the nodes of the branch. Maybe that is where the +2 comes in. – Justin Thomas Apr 28 '11 at 15:12
0

Depending on where you found this statement, there may be specific reasons why it is true in the context.

However, the usual reason for the '+k' is simply that it takes O(k) extra operations to insert each of the matches you found in the result list returned to the user. This is not necessarily the case when an inverted file is used instead of a suffix tree, because then the inverted list (aka postings list) found in the index is already the final results list (at least if we assume that (a) the query consists of a single token only, and (b) the inverted list is stored uncompressed).

But a suffix tree usually (unless it is specially prepared) does not contain such match lists. Hence during matching you identify a path through the tree, ending at some internal node. From there, you must follow all paths in the subtree of that internal node to identify the leaf nodes that tell you the actual positions of the matches (one leaf node per match), and insert the match positions in the results list that you return to the user. This final step is what takes O(k) time.

Also note that following all paths in the subtree of the internal node you found can take significant extra time, in which the total complexity is even higher than O(n+k). That depends on whether or not there are any direct pointers from internal nodes to the leaf nodes in their subtrees.

jogojapan
  • 68,383
  • 11
  • 101
  • 131