1

I need to get the following regular expression to work but having issues. Yes, it's parsing HTML. No, there's no better option to use.

This is the regex:

test(.*)\/[^s].*(=|\/|Z)

I'm using the "U" modifier (so it's ungreedy), and "\" is my escape symbol.

Plugging in this pattern:

test.com/sch/anythingwhateverZhello

Results in a match, when I don't think it should. The captures are ".com/sch" and "Z", although I (think) I specifically told it that it should A) capture only up to the first "/", so it should be ".com", and B) don't match if the first letter after the "/" is an "s". Interestingly -- and the probable source of my problem -- is when I remove the [^s], the capture now works correctly. With it in, the asterisk is gobbling up to the second "/", which makes no sense. I tried putting a question mark after the asterisk, just as a double hint to the regex that it should not be greedy, but this made no difference.

OK, so instead of a negated character class (I really don't want to exclude just "s"; I really would like to exclude "sch" specifically), I next tried a negative lookahead:

test(.*)\/(?!sch).*(=|\/|Z)

Same problem! Matching, and first capture is ".com/sch".

Any ideas what my blunder is here? (I've been using RexV2 regex validator at http://www.rexv.org/, so it occurred to me that there might be a bug in that engine, but I can replicate this issue in my live environment).

Rudi Visser
  • 21,350
  • 5
  • 71
  • 97
FoulFoot
  • 655
  • 5
  • 9
  • 1
    `test(.*)` that is your problem. Maybe it should be `test([^\/]*)`? The way it is, it's matching `test.com/sch` and then `/` and then there is no `s` in `anythingwhateverZhello`, so it keeps going. – Shef Feb 28 '13 at 20:06
  • 1
    You, sir, are a genius. That fixed it. I still don't understand why the ungreedy (.*) wouldn't stop at the first "/" (and indeed, it does, when there's no [^s] after it...), but I'll leave that for further scholars. Incidentally, your fix also makes the lookahead work, too. Thank you! – FoulFoot Feb 28 '13 at 20:18
  • Great, I will post that as an answer and you can mark your question as solved. – Shef Feb 28 '13 at 20:20

1 Answers1

0

test(.*) that is your problem. Maybe it should be test([^/]*)?

The way it is, it's matching test.com/sch, because . means any character, and then / and then there is no s in anythingwhateverZhello, so it keeps going.

Shef
  • 44,808
  • 15
  • 79
  • 90