7

Is it possible to serialize/deserialize and save/load regular expressions to/from a file?

We have a very time consuming process that constructs some regexes and I'm wondering if we can save some time by saving and loading them.

MBZ
  • 26,084
  • 47
  • 114
  • 191
  • [This questions](http://stackoverflow.com/questions/4499808/how-to-save-serialize-compiled-regular-expression-stdregex-to-a-file) may have some useful pointers (I'm not marking this as a duplicate because the other question is three years old and doesn't really have a satisfactory *answer*). – us2012 Sep 11 '13 at 23:18
  • Thank you, yes. The other question is quite old. – MBZ Sep 11 '13 at 23:25
  • About how many regex are we are talking here? If you recompile them for every operations it is logically that this drains performance, but if you compile them once this shouldnt be a problem. (De)Serialization can actually quite a tricky thing, dont expect it to implement it in a minute. – Sebastian Hoffmann Jan 15 '14 at 17:11
  • I'm talking about too many of them. We only compile them once but since there are too many them, it's still a very heavy process. – MBZ Jan 15 '14 at 17:55
  • 1
    since the regex is not meeting your requirement anymore, have you considered other approaches? like a dedicated compiler – zinking Jan 22 '14 at 07:24
  • A dedicated compiler is not a good idea for anything that's not using context-free features; a better idea would be to use a better or more fitting regex engine. – Alice Jan 22 '14 at 16:27

3 Answers3

7

No, it is probably not possible because it would require you to recompile the regex anyway.

However, if you use boost::xpressive, you can do the compilation of the regex during compile time via expression template construction of the regex. This will make it make the regex compile time go away entirely.

Boost Xpressive

However, the true cause of your excess time usage is almost certainly that you are using regexes improperly, IE, through the use of a backtracking regex engine.

RE2 is a traditional automata regex engine that does not use backtracking, but instead constructs an NFA or DFA directly. If you are not using backreferences or many non-regular expression based "features", using RE2 will show an order of magnitude increase in speed for many corner cases.

If you are using those features, you should be aware that they will strictly dominate the speed of your matching, and they are almost certainly the primary cause of the slow down you seek to eliminate.

Alice
  • 3,958
  • 2
  • 24
  • 28
2

This would be very difficult to achieve with boost/stl regex classes. The problem is the internal structure of said classes.

  1. How are the implementations storing class properties? by address, value?
  2. What additional padding has the compiler introduced, if any?
  3. Are there any platform alignment issues?
  4. etc...

To help illustrate the difficulty of the problem. Try and find the sizeof a C++ class instance.

regex pattern( "^Corvusoft$" );

printf( "%lu\n", sizeof( pattern ) );  //32

regex alternative( "C" );

printf( "%lu\n", sizeof( alternative ) );  //32

Alternative Solution 1

Create a library that contains the required regular expressions and link during build time or dynamically open and load the library via the dlopen api. You would then use tools such as prelink to ensure they are already in memory; pre-compiled.

Alternative Solution 2

Using the C regex.h.

You could walk the regex_t POD structure and write its contents to a binary or memory mapped file. At a later date you could map these data values back onto a new regex_t structure completely avoiding the recompilation of the regular expression.

#include <regex.h>

regex_t pattern;
int result = 0;

result = regcomp( &pattern, "^Corvusoft$", 0 );

if ( result )
{
   fprintf( stderr, "Failed to compile regular expression pattern\n" );
}

TODO: walk structure and write to binary/memory mapped file.

Alternative Solution 3

Follow @Alice's advice and use Boost.Xpressive

Ben Crowhurst
  • 8,204
  • 6
  • 48
  • 78
  • Thank you for your answer but I have a hard time understanding your first Solution. What do you mean by "pre-compiled regular expressions"? afaik, both boost and std regex are run time. – MBZ Jan 21 '14 at 16:25
  • I think he's a bit confused; for that solution to work, you'd still need to compile time generate the regular expressions, which would require either a hand rolled regular expression or Xpressive. – Alice Jan 22 '14 at 11:15
  • I hope my edit has clarified 'Alternative Solution 1'. – Ben Crowhurst Jan 22 '14 at 16:26
0

You can serialise a boost::regex:

#include <string>
#include <iostream>
#include <sstream>

#include <boost/regex.hpp>
#include <boost/serialization/serialization.hpp>
#include <boost/serialization/split_free.hpp>
#include <boost/archive/text_iarchive.hpp>
#include <boost/archive/text_oarchive.hpp>

namespace boost
{
  namespace serialization
  {
    template<class Archive, class charT, class traits>
    inline void save(Archive & ar, 
                     const boost::basic_regex<charT, traits> & t, 
                     const unsigned int /* file_version */)
    {
      std::basic_string<charT> str = t.str();
      typename boost::basic_regex<charT, traits>::flag_type flags = t.flags();
//      typename boost::basic_regex<charT, traits>::locale_type loc = t.getloc();
      ar & str;
      ar & flags;
//      ar & loc;
    }

    template<class Archive, class charT, class traits>
    inline void load(Archive & ar, 
                     boost::basic_regex<charT, traits> & t, 
                     const unsigned int /* file_version */)
    {
      std::basic_string<charT> str;
      typename boost::basic_regex<charT, traits>::flag_type flags;
//      typename boost::basic_regex<charT, traits>::locale_type loc;
      ar & str;
      ar & flags;
//      ar & loc;
      t.assign(str, flags);
//      t.imbue(loc);
    }

    template<class Archive, class charT, class traits>
    inline void serialize(Archive & ar, 
                          boost::basic_regex<charT, traits> & t, 
                          const unsigned int file_version)
    {
      boost::serialization::split_free(ar, t, file_version);
    }
  }
}

int main(int argc, char ** argv)
{
  std::stringstream os;

  {
    boost::regex re("<a\\s+href=\"([\\-:\\w\\d\\.\\/]+)\">");
    boost::archive::text_oarchive oar(os);
    oar & re;
  }

  os.seekg(std::ios_base::beg);

  boost::regex re;
  boost::cmatch matches;
  boost::archive::text_iarchive iar(os);

  iar & re;

  boost::regex_search("<a href=\"https://stackoverflow.com/questions/18752807/save-serialize-boost-or-std-regexes\">example</a>", matches, re);

  std::cout << matches[1] << std::endl;

}

But that doesn't mean that you'll achieve any performance gains versus reconstructing the regular expression from a string.

Note: I left out the std::locale stuff for simplicity

voodooattack
  • 1,127
  • 9
  • 16