1

I am using a simple regex to match against a string read from the OS that has teh following format:

timestamp:{comma separated list of values}

Where timestamp is unsigned the values are unsigned

To do this I was using the following regex using boost::xpressive

std::vector< uint32_t > idleReports;
uint32_t timestamp = 0;

sregex cpuIdleVal = ( +_d )[ push_back( xp::ref( idleReports ), as<unsigned>( _ ) ) ];
sregex cpuIdleData = cpuIdleVal >> "," | cpuIdleVal;
sregex fullMatch = ( +_d )[ xp::ref( timestamp ) = as<unsigned>( _ ) ] 
                                 >> ":" >> +cpuIdleData;
smatch what;
if( regex_match( test, what, fullMatch ) )
{
   // stuff
}

All works fine for the success case, benchmarking shows that the regex takes approx 80usec to match the following string:

"1381152543:988900,987661,990529,987440,989041,987616,988185,988346,968859,988919,859559,988967,991040,988942"

If the input string contains a negative value in one of the values the performance degrades significantly, shuch that if the value4 is negative the regex takes 13seconds to report a failure.

If value5 is negative the time taken is even longer.

Why is the performance so bad for failure cases?

I have fixed the issue by changing the original regex to:

sregex cpuIdleData = "," >> cpuIdleVal;
sregex fullMatch =  ( +_d )[ xp::ref( timestamp ) = as<unsigned>( _ ) ] 
                           >> ":" >> cpuIdleVal >> -+ cpuIdleData ;

i.e. making the match against the comma separated list non-greedy.

In the changed version the failure scenarios perform just as well (or slightly better than) the success scenarios.

mark
  • 7,381
  • 5
  • 36
  • 61

0 Answers0