Is it possible to serialize/deserialize and save/load regular expressions to/from a file?
We have a very time consuming process that constructs some regexes and I'm wondering if we can save some time by saving and loading them.
Is it possible to serialize/deserialize and save/load regular expressions to/from a file?
We have a very time consuming process that constructs some regexes and I'm wondering if we can save some time by saving and loading them.
No, it is probably not possible because it would require you to recompile the regex anyway.
However, if you use boost::xpressive, you can do the compilation of the regex during compile time via expression template construction of the regex. This will make it make the regex compile time go away entirely.
However, the true cause of your excess time usage is almost certainly that you are using regexes improperly, IE, through the use of a backtracking regex engine.
RE2 is a traditional automata regex engine that does not use backtracking, but instead constructs an NFA or DFA directly. If you are not using backreferences or many non-regular expression based "features", using RE2 will show an order of magnitude increase in speed for many corner cases.
If you are using those features, you should be aware that they will strictly dominate the speed of your matching, and they are almost certainly the primary cause of the slow down you seek to eliminate.
This would be very difficult to achieve with boost/stl regex classes. The problem is the internal structure of said classes.
To help illustrate the difficulty of the problem. Try and find the sizeof a C++ class instance.
regex pattern( "^Corvusoft$" );
printf( "%lu\n", sizeof( pattern ) ); //32
regex alternative( "C" );
printf( "%lu\n", sizeof( alternative ) ); //32
Alternative Solution 1
Create a library that contains the required regular expressions and link during build time or dynamically open and load the library via the dlopen api. You would then use tools such as prelink to ensure they are already in memory; pre-compiled.
Alternative Solution 2
Using the C regex.h.
You could walk the regex_t POD structure and write its contents to a binary or memory mapped file. At a later date you could map these data values back onto a new regex_t structure completely avoiding the recompilation of the regular expression.
#include <regex.h>
regex_t pattern;
int result = 0;
result = regcomp( &pattern, "^Corvusoft$", 0 );
if ( result )
{
fprintf( stderr, "Failed to compile regular expression pattern\n" );
}
TODO: walk structure and write to binary/memory mapped file.
Alternative Solution 3
Follow @Alice's advice and use Boost.Xpressive
You can serialise a boost::regex:
#include <string>
#include <iostream>
#include <sstream>
#include <boost/regex.hpp>
#include <boost/serialization/serialization.hpp>
#include <boost/serialization/split_free.hpp>
#include <boost/archive/text_iarchive.hpp>
#include <boost/archive/text_oarchive.hpp>
namespace boost
{
namespace serialization
{
template<class Archive, class charT, class traits>
inline void save(Archive & ar,
const boost::basic_regex<charT, traits> & t,
const unsigned int /* file_version */)
{
std::basic_string<charT> str = t.str();
typename boost::basic_regex<charT, traits>::flag_type flags = t.flags();
// typename boost::basic_regex<charT, traits>::locale_type loc = t.getloc();
ar & str;
ar & flags;
// ar & loc;
}
template<class Archive, class charT, class traits>
inline void load(Archive & ar,
boost::basic_regex<charT, traits> & t,
const unsigned int /* file_version */)
{
std::basic_string<charT> str;
typename boost::basic_regex<charT, traits>::flag_type flags;
// typename boost::basic_regex<charT, traits>::locale_type loc;
ar & str;
ar & flags;
// ar & loc;
t.assign(str, flags);
// t.imbue(loc);
}
template<class Archive, class charT, class traits>
inline void serialize(Archive & ar,
boost::basic_regex<charT, traits> & t,
const unsigned int file_version)
{
boost::serialization::split_free(ar, t, file_version);
}
}
}
int main(int argc, char ** argv)
{
std::stringstream os;
{
boost::regex re("<a\\s+href=\"([\\-:\\w\\d\\.\\/]+)\">");
boost::archive::text_oarchive oar(os);
oar & re;
}
os.seekg(std::ios_base::beg);
boost::regex re;
boost::cmatch matches;
boost::archive::text_iarchive iar(os);
iar & re;
boost::regex_search("<a href=\"https://stackoverflow.com/questions/18752807/save-serialize-boost-or-std-regexes\">example</a>", matches, re);
std::cout << matches[1] << std::endl;
}
But that doesn't mean that you'll achieve any performance gains versus reconstructing the regular expression from a string.
Note: I left out the std::locale stuff for simplicity