2

I do have a list of several hundred strings and an array of 10k regular expressions.

I now have to iterate over all strings and check which of the 10k regular expressions match. What's the most performant way to do this?

Currently I'm doing this:

myRegularExpression.firstMatch(in: myString, options: myMatchingOption, range: NSMakeRange(0, myString.characters.count)) == nil

where myRegularExpression is an NSRegularExpression stored for reuse and myMatchingOption is NSRegularExpression.MatchingOptions(rawValue: 0)

Is there a faster, more performant way to check if a string matches one of those 10k regular expressions?

EDIT:

I need to know not only IF one of my 10k regular expressions fit but also which one. So currently I do have a for loop inside a for-loop: the outer one iterates over my several hundred strings and for each of these strings I iterate over my 10k rules and see if one rule fits (of course if one fits I can stop for that string, so roughly:

for string in stringsToCheck {
    for rule in myRules {
        if string.matches(rule) {
            // continue with next string of stringsToCheck
        }
    }
}
swalkner
  • 16,679
  • 31
  • 123
  • 210
  • Can you exclude groups of strings and/or regices (i.e. do you have known patterns in your data?). The regexrunner is probably heavily optimised for a given regex and string, but it cannot know your data, e.g. if you have many regices with ^ or $ you could group all strings on first or last letter and exclude whole groups of strings in one mismatch. Also, precompile the regices of possible? – Love Tätting Sep 01 '17 at 18:50
  • Did you try to build a mega expression by combining all the regular expressions into a single one with | ? Not sure if the parser will survive 10,000 patterns but you may get some performance improvements even by combining them 10 or 20 at a time. – Alain T. Sep 02 '17 at 01:18
  • @LoveTätting thanks for your response, please see my edit... – swalkner Sep 02 '17 at 08:18
  • @AlainT. thanks for your response, please see my edit... – swalkner Sep 02 '17 at 08:18

1 Answers1

0

Depending on what platform you're running this on, separating the work in using multiple threads may provide some response time improvements but I believe that really dramatic optimization for this would require some insight on the nature of the regular expressions.

For example, if the expressions don't have a specific precedence order, you could rearrange (reorder) them to make the most likely "matches" come first in the list. This could be evaluated pre-emptively either by the supplier of the expressions or using some function to estimate their complexity (e.g. length of the expression, presence of optional or combinatory symbols). Or it could be evaluated statistically by collecting (and persisting) hit/miss counts for each expression. But of course such an optimization assumes that every string will match at least one expression and that the 80/20 rule applies (i.e 20% of the expressions match 80% of the strings).

If the expressions are very simple and only make use of letter patterns, then you would get better performance with more "manual" implementations of the matching functions (instead of regex). In the best case scenario, simple letter patterns can be converted into a character tree and yield orders of magnitude in performance improvements.

Note that these solutions are not mutually exclusive. For example, if a large proportion of the expressions are simple patterns and only a few have complex patterns then you don't have to throw away the baby with the bath water: you can apply the simple pattern optimization to a subset of the rules and use the "brute force" nested loop to the remaining complex ones.

I had a similar problem in the past where thousands of rules would need to be applied on hundreds of thousands of records for processing insurance claims. The traditional "expert system" approach was to create a list of rules and run every record through it. Obviously that would take a ridiculous amount of time (like 2 months execution time to process one month of claims). Looking at it with a less than "purist" mindset, I was able to convince my customer that rules should be defined hierarchically. So we split them into a set of eligibility rules and a set of decision rules. Then we further refined the structure by creating eligibility groups and decision groups. What we ended up with was a coarse tree structure where rules would allow the system to narrow down the number of rules that should be applied to a given record. With this, the 6 week processing time for 250,000 records was cut down to 7 hours (this was in 1988 mind you).

All this to say that taking a step back into the nature of the problem to solve may provide some optimization opportunities that are not visible when looking merely at the mechanics of one process option.

Alain T.
  • 40,517
  • 4
  • 31
  • 51