boost string matching DFA

Question

Given a string I have to test whether that ends with a known set of suffixes. Now as the the of suffixes are not very small and every word in the document has to be checked against that list of known suffixes. Every character in the word and suffix is char32_t. As a naive iterative matching will be expensive. Though most of the suffixes are not sub suffix or prefix of another suffix, most of them are constructed with a small set of characters. Most of the checks will be a miss rather than being a hit.

So I want to build a DFA of the suffixes to minimize the cost of miss. I can manually parse the unicode codepoints and create a DFA using boost-graph. But is there any existing library that will build that for me ?

Will a huge regex containing all the suffixes be less expensive than the DFA, because regex search also builds a DFA for matching in the similar way ? But I want to know which suffix was matched when there is a hit. In regex case I'll need to perform another linear search to get that (I cannot label the vertices of the regex's internal DFA). Also I need unicode regex. Just putting all the suffixes with a | will be as expensive as a linear search I guess. I think I need to check the common characters and create the regex accordingly with lookahed and lookbacks. Isn't it the same difficulty that I need to face to construct the DFA manually ?

I am using utf-32 for random access. However it is not a problem to switch to utf-8 if I can solve it easily with that. I will reverse the string and the pattern right to left.

"Just putting all the suffixes with a | will be as expensive as a linear search I guess" - nah regexes usually compile pretty optimizingly — sehe, Jan 07 '18 at 00:26

sehe · Accepted Answer · 2018-01-07T00:24:53.347

1

Have you considered Spirit? Of course you didn't specify how you detect suffixes in context (do you require them at the end, do you require some grammar preceding it etc.) but you could do something like this:

    x3::symbols<Char> sym;
    sym += "foo", "bar", "qux";

It builds a Trie, which is pretty effective. It can parse any kind of input iterator (including streams if you are so inclined). Just add a bit of magic constrain for contextual requirements, like end-of-input:

bool has_suffix(string_view sv) {
    return parse(sv.cbegin(), sv.cend(), x3::seek[suffix >> x3::eoi]);
}

If you even wish to return the text value of the string, simply do this:

string_view get_suffix(string_view sv) {
    boost::iterator_range<string_view::const_iterator> output;
    parse(sv.cbegin(), sv.cend(), x3::seek[x3::raw[suffix >> x3::eoi]], output);
    return {output.begin(), output.size()};
}

Spirit leaves you a lot of freedom to surround with smarts, dynamically add/remove symbols, e.g. use no_case with the Trie etc.

Full Demo

Using X3 (c++14)

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <string_view>
#include <cstdint>

namespace Demo {
    using Char = char32_t;
    using string_view = std::basic_string_view<Char>;

    namespace x3 = boost::spirit::x3;

    static auto const suffix = [] {
        x3::symbols<Char> sym;
        sym += "foo", "bar", "qux";

        return sym; // x3::no_case[sym];
    }();

    bool has_suffix(string_view sv) {
        return parse(sv.cbegin(), sv.cend(), x3::seek[suffix >> x3::eoi]);
    }

    string_view get_suffix(string_view sv) {
        boost::iterator_range<string_view::const_iterator> output;
        parse(sv.cbegin(), sv.cend(), x3::seek[x3::raw[suffix >> x3::eoi]], output);
        return {output.begin(), output.size()};
    }
}

#include <iostream>
#include <iomanip>

int main() {
    using namespace Demo;

    auto widen = [](string_view sv) { return std::wstring(sv.begin(), sv.end()); };
    std::wcout << std::boolalpha;

    for (string_view testcase : { U"nope", U"lolbar you betqux" }) {
        std::wcout 
            << widen(testcase) 
            << L" -> " << has_suffix(testcase)
            << L" (" << widen(get_suffix(testcase))
            << L")\n";
    }
}

Prints

nope -> false ()
lolbar you betqux -> true (qux)

Spirit Qi Version

A literal port: Live On Coliru

A C++11 only version: Live On Coliru

And a C++03 version for the really retro programming experience: Live On Coliru

edited Jan 07 '18 at 00:24

answered Jan 06 '18 at 23:56

sehe

374,641
47
450
633

Added Spirit Qi versions in [c++17](http://coliru.stacked-crooked.com/a/e348c8f745633c49), [c++11](http://coliru.stacked-crooked.com/a/f7a8cd49c241808b) and [c++03](http://coliru.stacked-crooked.com/a/5bfda96e81efc342) – sehe Jan 07 '18 at 00:25
Hah! I read your question better and noticed _"I will reverse the string and the pattern right to left"_ - if that's the idea, you can use my approach with reverse-iterators without any seeking: **[live in c++17](http://coliru.stacked-crooked.com/a/3242bba6285de61e)** and [**live in c++03](http://coliru.stacked-crooked.com/a/d0e3f9c48d2259cb)** – sehe Jan 07 '18 at 00:37
Given 5 suffixes `"ol", "mol","xol","ptol", "ku"` and a string `"kumol"` I expect it to match `l <> {l | u}` first, If that matches with `u` then check `o <> {k}` otherwise check. `o <> {o}` like in syntax tree. But the code in your answer looks like it does a linear scan on the list of suffixes. With spirit I can create such a tree with codepoints, but that will be hectic manual work, and hard to change in future. If I have just the list of suffixes, can spirit generate the tree as mentioned ? – Neel Basu Jan 07 '18 at 08:13
@NeelBasu Google Trie. (It takes considerably less time than writing that comment, I think). It does exactly what you describe when using the reverse iterator version. – sehe Jan 07 '18 at 11:35
:yes I know trie but how to build that try automatically from the suffixes ? what library function to use ? would regex build the trie for me even if I provide suffixes ORed `ol|mol|xol|ptol|ku`like this ? – Neel Basu Jan 07 '18 at 20:13
The code is there... (6 variations, live, even)? I think I don't understand your question. You just feed your suffixes, and when you're done the `symbols` contains the Trie. – sehe Jan 07 '18 at 20:16
1

Oh! I didn't understand that the `symbols` contain the Trie. I thought `foo`, `bar`, `qux` doesn't have a common sub suffix in the answer, which is Intentional. I thought Even if it is `ol|mol` it will scan linearly and check all characters instead of constructing a trie. Thank You. – Neel Basu Jan 09 '18 at 11:14
1

Yeah, it would work fine with `afoo`, `afbar`, afbqux`, `qux`, `quxfoo` just fine – sehe Jan 09 '18 at 12:01
Hm, I didn't know that, also I didn't find that in documentation. – Neel Basu Jan 09 '18 at 12:32
1

Documentation doesn't say much about implementation details. However the functionality is promised. (The hint in http://www.boost.org/doc/libs/1_66_0/libs/spirit/doc/html/spirit/qi/reference/string/symbols.html is merely the `tst` namespace) – sehe Jan 09 '18 at 12:32

boost string matching DFA

1 Answers1

Full Demo

Spirit Qi Version

Linked