0

Its a more general question so it doesn't really matter what regex expression I use but here is my current one:

(?<name>[^ ^\n]*)[ \n]+OBJECT IDENTIFIER(?<data>([^"]*"[^"]*?")*?[^"]*?)::=[ \n]*\{[ \n]*(?<parent>[^ ^\n]*) (?<oid>\w*)

The main thing in this regex is the OBJECT IDENTIFIER keyword, can I make the regex search for this first ignoring the (?<name>[^ ^\n]*)[ \n]+ in front and then after the regex found OBJECT IDENTIFIER it should apply the whole expression to that location.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Vajura
  • 1,112
  • 7
  • 16
  • it's seems unclear. If you want to ignore this `(?[^ ^\n]*)[ \n]+` then you could use a lookbehind. – Avinash Raj Jul 31 '14 at 07:51
  • The `` kind of makes me think you are using regex to parse an xml file in which case there are better ways of doing so. General cases make it impossible to test "performance" since they are not being performed – Sayse Jul 31 '14 at 07:51
  • What do you mean by "apply the whole expression to THAT location". You can search for your word, then use regex to do further search on the whole text ... – trainoasis Jul 31 '14 at 07:52
  • @AvinashRaj yea i want to ignore it at first and then check it later if OBJECT IDENTIFIER is found – Vajura Jul 31 '14 at 07:52
  • try `(?<=[^ ^\n]*[ \n]+)OBJECT IDENTIFIER` – Avinash Raj Jul 31 '14 at 07:53
  • @AvinashRaj Hm whats the syntax to do the ? to that backtrace, still need that – Vajura Jul 31 '14 at 07:54
  • `?` it's a capturing group name. You don't need that. – Avinash Raj Jul 31 '14 at 07:55
  • I need it to reference to it later – Vajura Jul 31 '14 at 07:56
  • @Vajura - Am I right about this being an xml file? the tags seem to suggest it is – Sayse Jul 31 '14 at 07:57
  • @Sayse nope its a snmp MIB file – Vajura Jul 31 '14 at 07:59
  • In which case, [this question](http://stackoverflow.com/questions/848132/how-can-i-read-an-snmp-mib-file-in-c) may be a better solution. – Sayse Jul 31 '14 at 08:00
  • whoop MIB file, sounds like Cacti – trainoasis Jul 31 '14 at 08:00
  • @Sayse dont worry m8 i checked all of that out, i just need to do this fast rather then perfect. It reads and parses the files perfectly i just wanted to optimize it a bit – Vajura Jul 31 '14 at 08:02
  • In general if you are having backtracking problems, you can (in some cases) change the expression use atomic groups `(?>...)` to eliminate backtracking. Your problem is possibly `([^"]*"[^"]*?")*?[^"]*?`. What should that subexpression match? Should it match over several lines? – Qtax Jul 31 '14 at 08:29

1 Answers1

2

edit:

I thought that with this kind of pattern the transmission will find the fixed string position and after that the regex engine will search backward the begining of the pattern, but it can't do that. It is not as smart as I thought.

In this case, you can exploit "the first string discrimination" to speed up the recognition writing something like this:

fixed string(?<=non-fixed subpattern, fixed string)

where the fixed string is in the first position in the pattern and, accordingly, allows the transmission to use the Boyer-Moore algorithm to find the position of the fixed string.

or you can try:

(?<=non-fixed subpattern)fixed string 

I can't test it, but it may be possible that the simple fact to put the "non-fixed subpattern" inside a lookbehind allows the transmission to choose what must be the best way to find the match position between testing the lookbehind or finding the fixed string. But I don't know if the transmission is smart enough to do that, it's only a supposition.

As Qtax notices it in a comment, the subpattern ([^"]*"[^"]*?")*?[^"]*? is potentially slow, because you use a lazy quantifier (the one for the group) that precedes [^"]*? that can match the same thing than the begining of the group and that can give an empty match.

Instead of this subpattern, you can use the ::= that comes after and write something like: [^:]+(?>:(?!:=)[^:]*)* that doesn't cause backtracking and use only greedy quantifiers. Note: if you really need to skip ::= that are between double quotes, you can use this: [^:"]+(?>(?::(?!:=)|"[^"]*")[^:]*)*.

An other small optimisation: don't use capturing group when you don't need to capture something, use non-capturing groups instead or atomic groups when it is better and possible.

Conclusion, you can test these patterns (written for the free-spacing mode):

const string Pattern1 = @"\b OBJECT [ ] IDENTIFIER
    (?<= (?<name> [^^\s]+ ) \s+ OBJECT [ ] IDENTIFIER)
    (?<data> [^:]+(?> :(?!:=) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg1 = new Regex(Pattern1, RegexOptions.IgnorePatternWhitespace);

const string Pattern2 = @"(?<= (?<name> [^^\s]+ ) \s+)
    \b OBJECT [ ] IDENTIFIER
    (?<data> [^:]+(?> :(?!:=) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg2 = new Regex(Pattern2, RegexOptions.IgnorePatternWhitespace);

const string Pattern3 = @"\b OBJECT [ ] IDENTIFIER
    (?<= (?<name> [^^\s]+ ) \s+ OBJECT [ ] IDENTIFIER)
    (?<data> [^:"]+(?> (?: :(?!:=) | "" [^""]* "" ) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg3 = new Regex(Pattern3, RegexOptions.IgnorePatternWhitespace);

const string Pattern4 = @"(?<= (?<name> [^^\s]+ ) \s+)
    \b OBJECT [ ] IDENTIFIER
    (?<data> [^:"]+ (?> (?: :(?!:=) | "" [^""]* "" ) [^:]* )* ) ::= \s* { \s*
    (?<parent> [^^\s]+ ) \s+
    (?<oid> \w+ )";

static Regex Reg4 = new Regex(Pattern4, RegexOptions.IgnorePatternWhitespace);

Note: Since I haven't your data under the eyes, I assumed that each named captures can not be empty. If it is not the case, you only need to change some + quantifiers to *.

Note2: It can be interesting if you try these patterns with and without the compiled regex option if you need to use the pattern several times in your code.

old answer:

You don't need to do that because there is a pre-analysis phase before the regex engine work itself.

This phase is called "transmission" and consists of several optimizations. One of these optimizations consists of finding fixed strings from the pattern in the target string first, using the Boyer-Moore algorithm to reduce the regex engine work.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125