Given a string I have to test whether that ends with a known set of suffixes. Now as the the of suffixes are not very small and every word in the document has to be checked against that list of known suffixes. Every character in the word and suffix is char32_t
. As a naive iterative matching will be expensive. Though most of the suffixes are not sub suffix or prefix of another suffix, most of them are constructed with a small set of characters. Most of the checks will be a miss rather than being a hit.
So I want to build a DFA
of the suffixes to minimize the cost of miss. I can manually parse the unicode codepoints and create a DFA using boost-graph
. But is there any existing library that will build that for me ?
Will a huge regex containing all the suffixes be less expensive than the DFA, because regex search also builds a DFA for matching in the similar way ? But I want to know which suffix was matched when there is a hit. In regex case I'll need to perform another linear search to get that (I cannot label the vertices of the regex's internal DFA). Also I need unicode
regex. Just putting all the suffixes with a |
will be as expensive as a linear search I guess. I think I need to check the common characters and create the regex accordingly with lookahed and lookbacks. Isn't it the same difficulty that I need to face to construct the DFA manually ?
I am using utf-32
for random access. However it is not a problem to switch to utf-8 if I can solve it easily with that. I will reverse the string and the pattern right to left.