Is this possible in regex to speed up performance?

Question

Its a more general question so it doesn't really matter what regex expression I use but here is my current one:

(?<name>[^ ^\n]*)[ \n]+OBJECT IDENTIFIER(?<data>([^"]*"[^"]*?")*?[^"]*?)::=[ \n]*\{[ \n]*(?<parent>[^ ^\n]*) (?<oid>\w*)

The main thing in this regex is the OBJECT IDENTIFIER keyword, can I make the regex search for this first ignoring the (?<name>[^ ^\n]*)[ \n]+ in front and then after the regex found OBJECT IDENTIFIER it should apply the whole expression to that location.

it's seems unclear. If you want to ignore this `(?[^ ^\n]*)[ \n]+` then you could use a lookbehind. — Avinash Raj, Jul 31 '14 at 07:51
The `` kind of makes me think you are using regex to parse an xml file in which case there are better ways of doing so. General cases make it impossible to test "performance" since they are not being performed — Sayse, Jul 31 '14 at 07:51
What do you mean by "apply the whole expression to THAT location". You can search for your word, then use regex to do further search on the whole text ... — trainoasis, Jul 31 '14 at 07:52
@AvinashRaj yea i want to ignore it at first and then check it later if OBJECT IDENTIFIER is found — Vajura, Jul 31 '14 at 07:52
@AvinashRaj Hm whats the syntax to do the ? to that backtrace, still need that — Vajura, Jul 31 '14 at 07:54
@Vajura - Am I right about this being an xml file? the tags seem to suggest it is — Sayse, Jul 31 '14 at 07:57
In which case, [this question](http://stackoverflow.com/questions/848132/how-can-i-read-an-snmp-mib-file-in-c) may be a better solution. — Sayse, Jul 31 '14 at 08:00
@Sayse dont worry m8 i checked all of that out, i just need to do this fast rather then perfect. It reads and parses the files perfectly i just wanted to optimize it a bit — Vajura, Jul 31 '14 at 08:02
In general if you are having backtracking problems, you can (in some cases) change the expression use atomic groups `(?>...)` to eliminate backtracking. Your problem is possibly `([^"]*"[^"]*?")*?[^"]*?`. What should that subexpression match? Should it match over several lines? — Qtax, Jul 31 '14 at 08:29

Casimir et Hippolyte · Accepted Answer · 2014-08-01T16:55:48.703

edit:

I thought that with this kind of pattern the transmission will find the fixed string position and after that the regex engine will search backward the begining of the pattern, but it can't do that. It is not as smart as I thought.

In this case, you can exploit "the first string discrimination" to speed up the recognition writing something like this:

fixed string(?<=non-fixed subpattern, fixed string)

where the fixed string is in the first position in the pattern and, accordingly, allows the transmission to use the Boyer-Moore algorithm to find the position of the fixed string.

or you can try:

(?<=non-fixed subpattern)fixed string

I can't test it, but it may be possible that the simple fact to put the "non-fixed subpattern" inside a lookbehind allows the transmission to choose what must be the best way to find the match position between testing the lookbehind or finding the fixed string. But I don't know if the transmission is smart enough to do that, it's only a supposition.

As Qtax notices it in a comment, the subpattern ([^"]*"[^"]*?")*?[^"]*? is potentially slow, because you use a lazy quantifier (the one for the group) that precedes [^"]*? that can match the same thing than the begining of the group and that can give an empty match.

Instead of this subpattern, you can use the ::= that comes after and write something like: [^:]+(?>:(?!:=)[^:]*)* that doesn't cause backtracking and use only greedy quantifiers. Note: if you really need to skip ::= that are between double quotes, you can use this: [^:"]+(?>(?::(?!:=)|"[^"]*")[^:]*)*.

An other small optimisation: don't use capturing group when you don't need to capture something, use non-capturing groups instead or atomic groups when it is better and possible.

Conclusion, you can test these patterns (written for the free-spacing mode):

const string Pattern1 = @"\b OBJECT [ ] IDENTIFIER
    (?<= (?<name> [^^\s]+ ) \s+ OBJECT [ ] IDENTIFIER)
    (?<data> [^:]+(?> :(?!:=) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg1 = new Regex(Pattern1, RegexOptions.IgnorePatternWhitespace);

const string Pattern2 = @"(?<= (?<name> [^^\s]+ ) \s+)
    \b OBJECT [ ] IDENTIFIER
    (?<data> [^:]+(?> :(?!:=) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg2 = new Regex(Pattern2, RegexOptions.IgnorePatternWhitespace);

const string Pattern3 = @"\b OBJECT [ ] IDENTIFIER
    (?<= (?<name> [^^\s]+ ) \s+ OBJECT [ ] IDENTIFIER)
    (?<data> [^:"]+(?> (?: :(?!:=) | "" [^""]* "" ) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg3 = new Regex(Pattern3, RegexOptions.IgnorePatternWhitespace);

const string Pattern4 = @"(?<= (?<name> [^^\s]+ ) \s+)
    \b OBJECT [ ] IDENTIFIER
    (?<data> [^:"]+ (?> (?: :(?!:=) | "" [^""]* "" ) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg4 = new Regex(Pattern4, RegexOptions.IgnorePatternWhitespace);

Note: Since I haven't your data under the eyes, I assumed that each named captures can not be empty. If it is not the case, you only need to change some + quantifiers to *.

Note2: It can be interesting if you try these patterns with and without the compiled regex option if you need to use the pattern several times in your code.

old answer:

You don't need to do that because there is a pre-analysis phase before the regex engine work itself.

This phase is called "transmission" and consists of several optimizations. One of these optimizations consists of finding fixed strings from the pattern in the target string first, using the Boyer-Moore algorithm to reduce the regex engine work.

[`In general, this works only when the literal string is embedded a fixed distance into any match`](http://books.google.co.uk/books?id=GX3w_18-JegC&pg=PA247&lpg=PA247&dq=regex+transmission++Boyer-Moore) — Rawling, Jul 31 '14 at 08:45
Actualy in my case i managed to do it with a lookbehind before the first literal string and now it works more then twice as fast — Vajura, Aug 01 '14 at 10:11
@Vajura: I have edited my answer, could you give a feedback on the tests. — Casimir et Hippolyte, Aug 01 '14 at 16:57
@CasimiretHippolyte yea i did something very similiar and it works fine and much faster — Vajura, Aug 04 '14 at 05:08

Is this possible in regex to speed up performance?

1 Answers1