Substring search on HashMap Values

Question

Given a HashMap, I want to retrieve all the entries e whose value contains a given substring s (Non case-sensitive). I am looking for substring index ideas on the lines of Suffix trees (trie) which are suited only for prefix/suffix matches.

Are you using Java? Are the set of substring keys already known or will it be dynamic? — Tim Biegeleisen, Jun 28 '16 at 00:26
@TimBiegeleisen I am using C#. Yes the keys will be dynamic. — Aswin Siva N, Jun 28 '16 at 00:58
@AbdullahTellioglu - yes I am currently doing a complete iteration which is inefficient. — Aswin Siva N, Jun 28 '16 at 00:58
@AswinSivaN if you do not have to do it with hashmap , try to use search tree. It has logn complexity as average which is okey — Abdullah Tellioglu, Jun 28 '16 at 01:11
If you lookup part of the values very often, you may think about to use the values you are looking up as keys. Maybe you build a different hashmap with the looked up value-substrings as keys. That, you don't need to iterate all values. — ckruczek, Jun 28 '16 at 04:55

Rerito · Answer 1 · 2016-07-06T22:16:41.137

A solution based on Generalized Suffix Trees

Suffix trees are not only suited for suffix matching. Here is what you can do:

Build a generalized suffix tree with every entry in your hashtable. Note that you will have to convert all the strings to an arbitrary case in order to ignore case. During construction, label each leaf with the set of strings that share it (for example the strings hazelnut and coconut will share the leaves representing nut, ut and t)
Starting from the root:
- Walk down the tree with the substring s (converted to the case chosen in the first step): you end up in either an implicit state (ie in midle of an edge) or an explicit state (you end up in a node N ).
- If you are in an implicit state, just take the destination node of the edge you're in, let's call that node N.
Compute the union of the strings set of all the leaves you can reach from N: you get a set of string S
S is the set of all the strings in your hashtable that have the substring s

Complexity analysis

Let K be the number of strings in your table. Let L_i be the respective lengths of the strings S_i and let L = ∑ L_i . The construction of the tree will be O(L).

Walking down to the node N is O(length(s)).

Now the trickier part begins. Listing all the leaves reachable by a node won't be linear, but it won't be too much of a hassle.

Let L_max = max(L_i), then you can reach each leaf by walking at most L_max nodes, and more precisely, if you start from the node N previously defined, you will reach each child leaf of N in at most L(s) = L_max - length(s) steps.

The subtree starting from N also has the structure of a generalized suffix tree. It represents at most K strings of length at most L(s). Any of these strings have at most L(s) leaves. So iterating over them is at most O(K.L(s)²).

Computing the union of the set of strings in each such leaf is then at most O([K.L(s)]²). (In reality it will be much closer to O(K.L(s)²) because if each leaf has all the original K strings in its set, then there is only L(s) leaves in the subtree rooted at N).

This leads to a total worst case complexity of:

O(L + length(s) + [K.L(s)]²)

But real usage complexity will be much closer to:

O(L + length(s) + K.[L(s)]²)

Standard method (iterating over each string and searching for s in each string) is:

O(∑ (L_i + length(s))) = O(L + K.length(s))

But wait... We always look for the substring s! So the preprocessing for the KMP algorithm can be done only once... This reduces the complexity for this approach to:

O(L + length(s))

Though, in order to benefit from such optimization, you would have to write it yourself instead of using standard implementation...

Conclusion

Assuming you only need to test one string s over your map, the naive solution is simple to implement, simple to understand and its overall complexity is not only better than the suffix tree based approach, it is optimal. So you can confidently stick to it.

However, if you have to test a large number K_s of strings s_j, then the suffix tree approach can be better because its overall complexity will be at most:

O(L + K_s.(max(length(s_j)) + [K.max(L(s_j)]²))

Whereas the KMP approach will lead to a total complexity of:

O(K_s.L + ∑ (length(s_j))) = O(K_s.[L + max(length(s_j))])

Please also note that since a suffix tree is an arborescent structure, if it is not cleverly designed, memory access and allocation will come into play and can seriously harm the run time.

I can work up an example (in C++ but still it would make the point) if you wish.

Substring search on HashMap Values

1 Answers1

A solution based on Generalized Suffix Trees

Complexity analysis

Conclusion