0

My iphone app uses regular expressions (with NSRegularExpression) to perform calculations over a very large number of strings (in the 1000s). This of course takes a lot of time. What are some strategies to speed up the regular expressions? I looked into using blocks, but I don't think it will do any good -- they seem to mostly represent lambda functionality (i.e., equivalent to lisp) and are used on the Mac with multiple cores. Obviously, the current iPhone doesn't have multiple cores.

Here's my code:

NSString *replaceRegexPattern = @"([\\(|\\[].*?[\\)|\\]])|(^to )";
NSRegularExpression *replaceRegex = [[NSRegularExpression regularExpressionWithPattern:replaceRegexPattern
                                                                              options:NSRegularExpressionCaseInsensitive
                                                                               error:nil] retain];
NSArray *myArray = <some data>;
NSString *myString, *compareValue;
for (i = 0; i < [myArray count]; i++) {
    myString = [myArray objectAtIndex:i];
    compareValue = [replaceRegex stringByReplacingMatchesInString:myString
                                                          options:0
                                                            range:NSMakeRange(0, [myString length])
                                                     withTemplate:@""];
    // do things with compareValue

}

To answer the question below, my goal in this code is to remove any text in my string which either is enclosed in parentheses, or which begins with "to ". Here are some examples:

  • Hello (Goodbye) --> Hello
  • Hello (Goodbye [n]) --> Hello
  • To Say --> Say
  • To Say (pf) --> Say
Jason
  • 14,517
  • 25
  • 92
  • 153
  • Your expression removes "to ", "TO ", "tO " and "To " indifferently. If you only care about one case, you can speed it up by removing option `NSRegularExpressionCaseInsensitive`. – Cœur Oct 13 '15 at 06:02

3 Answers3

1

The best way to speed up that regex would be to use possessive quantifiers:

NSString *replaceRegexPattern = 
    @"^to\\s++|\\[[^\\[\\]]*+\\]|\\([^()]*+\\)";

In cases where no match is possible because an opening bracket isn't matched by the correct closing bracket, the *+ prevents backtracking that we know would be pointless. But successful match attempts are more efficient, too, because the regex engine doesn't have to save the state information that makes backtracking possible.

As Tim pointed out, this won't match nested instances of the same kind of bracket, like ((foo)) or [[bar]]. It will match any number of square brackets inside matched parentheses, or vice-versa. It doesn't require those inner brackets to be properly paired, so it will match (foo[) or [(bar))], for example. That was true of your original regex, too.

Including the opening brackets in the character classes prevents lopsided matches like [[foo] or ((bar).

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

Are you sure regular expressions are the right tool for this?

If all you're trying to do is remove the text within parentheses, a simple char-by-char loop through the string could do that very easily, and even handle nested parens correctly.

In pseudo-code:

 nesting_level = 0;
 while more_chars {
       c = next_char;
       if c == '(' or c == '[') 
           ++nesting_level;
       else if c == ')' or c == ']'
           --nesting_level;   // check for nesting_level < 0 here?
       else if nesting_level == 0
           result += c;
 }

Obviously, do your own benchmarks, but it's possible you'll get better performance by avoiding regex's.

(and if you care about detecting ill-formed things like "(hello]", you could add simple recursive descent to this)

David Gelhar
  • 27,873
  • 3
  • 67
  • 84
0

Since I don't know what exactly you're trying to do, it's hard to give well-founded advice, but it looks like your regex could be improved a little.

Are you really trying to match strings like (foo), [bar], and |baz|? You don't need the | alternator inside character classes, so unless you want to match the third example here, drop the |s.

Then, since you're expecting strings like (foo [bar] baz), you need to separate the two kinds of parentheses, and you can also speed up your regex a bit:

@"^to |\\([^)]*\\)|\\[[^\\]]*\\]"

This checks for to at the start of the string first, then goes looking for an opening paren/bracket, anything except closing parens/brackets, and a closing paren/bracket. This needs less backtracking so it's probably a bit faster.

You won't be able to handle nested parentheses/brackets of the same kind ((foo (bar) baz)) with a single regex because that's not regular anymore - unless you run the regex replace operation several times, once for each level of nesting. So the above example will be removed if you run the regex replace twice.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks for the suggestion. I would like to be able to match nested parentheticals, but it's not the most important aspect. – Jason Mar 01 '11 at 10:54
  • Can it also be like this? `(foo (bar) baz)`? Because this (arbitrary nesting) is not a regular language any more and can't be matched by a normal regex. – Tim Pietzcker Mar 01 '11 at 11:06
  • Right. The regex that I came up with is: `\([^\)]*(\[[^\]]*\])?[^\)]*\)|\[[^\]]*\]`, which matches `(foo)`, `[bar]`, and `(foo [bar] baz)` – Jason Mar 01 '11 at 12:08
  • What's wrong with the solution I proposed? It will match the same strings. – Tim Pietzcker Mar 01 '11 at 12:27