Using Boost Xpressive (static expression) , I noticed that pattern searching is much slower when the expression is built from sub regexpression.
Did I miss something ? or is it inherent with the design ? Xpresive docs says https://www.boost.org/doc/libs/1_80_0/doc/html/xpressive/user_s_guide.html#boost_xpressive.user_s_guide.grammars_and_nested_matches.embedding_a_regex_by_value
it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex participates fully in the match, back-tracking as needed to make the match succeed.
Consider these 2 ways of defining a regexp matching an uri (probably sub-optimal and not 100%, but the point is not on this topic).
If the expression is defined in one go, execution is around 6x faster than if the same regex is built from 3 sub regex.
Consider this code snippet
#include <iostream>
#include <string>
#include <chrono>
#include <boost/xpressive/xpressive.hpp>
using namespace boost::xpressive;
void bench_regex(const sregex& regex)
{
std::string positive = "asdas http://www.foo.io/bar asdp https://www.bar.io ";
std::string negative = "sdaoas dof jdfjo fds dsf http:/www.nonono .sa ";
const int nb_iterations = 100'000;
int nmatch = 0;
smatch what;
std::chrono::steady_clock::time_point begin0 = std::chrono::steady_clock::now();
for (int i = 0 ; i < nb_iterations; ++i)
{
if (regex_search( positive, what, regex ))
nmatch++;
if (regex_search( negative, what, regex ))
nmatch++;
}
std::chrono::steady_clock::time_point end0 = std::chrono::steady_clock::now();
std::cout << "nb matchs " << nmatch << std::endl;
std::cout << "search time " << std::chrono::duration_cast<std::chrono::microseconds>(end0-begin0).count()/1000.0f <<"ms" << std::endl << std::endl;
}
int main()
{
{
std::cout << "regex in one piece" << std::endl;
const sregex regex_uri_standalone = alpha >> *alnum >> "://" >> + ~(set= ' ','/') >> !( *('/' >> ~(set=' ')));
bench_regex(regex_uri_standalone);
}
{
std::cout << "regex built from part" << std::endl;
const sregex scheme = alpha >> *alnum;
const sregex hostname = + ~(set= ' ','/');
const sregex path = !( *('/' >> ~(set=' ')));
const sregex regex_uri_built_from_subregex = scheme >> "://" >> hostname >> path;
bench_regex(regex_uri_built_from_subregex);
}
}
This is particularly annoying because a main force of Xpressive is the ability to construct complex regexp from simplier one, which can be quickly become a nightmare if using pcre or equivalent. But if it comes with such a performance cost, the benefit looks annihilated.
btw, is the library still maintained ? according to boost changelog, no change since boost 1.55 (11Nov 2013 !) https://www.boost.org/users/history/