6

I am using a regex to find:

test:?

Followed by any character until it hits the next:

test:?

Now when I run this regex I made:

((?:test:\?)(.*)(?!test:\?))

On this text:

test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2

I expected to get:

test:?foo2=bar2&baz2=foo2

test:?foo=bar&baz=foo

test:?foo2=bar2&baz2=foo2

But instead it matches everything. Does anyone with more regex experience know where I have gone wrong? I've used regexes for pattern matching before but this is my first experience of lookarounds/aheads.

Thanks in advance for any help/tips/pointers :-)

tchrist
  • 78,834
  • 30
  • 123
  • 180
james
  • 792
  • 7
  • 15
  • Are you just wanting to split on `test:?`? If you are, your environment will provide a way of doing that without regular expressions. – Chris Morgan Feb 25 '12 at 00:49

4 Answers4

5

I guess you could explore a greedy version.
(expanded)

(test:\? (?: (?!test:\?)[\s\S])* )

2

The Perl program below

#! /usr/bin/env perl

use strict;
use warnings;

$_ = "test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2";

while (/(test:\?  .*?) (?= test:\? | $)/gx) {
  print "[$1]\n";
}

produces the desired output from your question, plus brackets for emphasis.

[test:?foo2=bar2&baz2=foo2]
[test:?foo=bar&baz=foo]
[test:?foo2=bar2&baz2=foo2]

Remember that regex quantifiers are greedy and want to gobble up as much as they can without breaking the match. Each subsegment to terminate as soon as possible, which means .*? semantics.

Each subsegment terminates with either another test:? or end-of-string, which we look for with (?=...) zero-width lookahead wrapped around | for alternatives.

The pattern in the code above uses Perl’s /x regex switch for readability. Depending on the language and libraries you’re using, you may need to remove the extra whitespace.

Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • I used your sytnax in a regex tester but the match still gave me a string containing two "test:?" strings. I am using Java so I assume it might by syntax related? (I removed the whitespaces for testing). Thanks for your help though I learned a lot more. – james Feb 25 '12 at 12:33
0

Three issues:

  • (?!) is a negative lookahead assertion. You want (?=) instead, requiring that what comes next is test:?.

  • The .* is greedy; you want it non-greedy so that you grab just the first chunk.

  • You're wanting the last chunk also, so you want to match $ as well at the end.

End result:

(?:test:\?)(.*?)(?=test:\?|$)

I've also removed the outer group, seeing no point in it. All RE engines that I know of let you access group 0 as the full match, or some other such way (though perhaps not when finding all matches). You can put it back if you need to.

(This works in PCRE; not sure if it would work with POSIX regular expressions, as I'm not in the habit of working with them.)

If you're just wanting to split on test:?, though, regular expressions are the wrong tool. Split the strings using your language's inbuilt support for such things.

Python:

>>> re.findall('(?:test:\?)(.*?)(?=test:\?|$)',
... 'test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2')
['foo2=bar2&baz2=foo2', 'foo=bar&baz=foo', 'foo2=bar2&baz2=foo2']
Chris Morgan
  • 86,207
  • 24
  • 208
  • 215
-1

You probably want ((?:test:\?)(.*?)(?=test:\?)), although you haven't told us what language you're using to drive the regexes.

The .*? matches as few characters as possible without preventing the whole string from matching, where .* matches as many as possible (is greedy).

Depending, again, on what language you're using to do this, you'll probably need to match, then chop the string, then match again, or call some language-specific match_all type function.

By the way, you don't need to anchor a regex using a lookahead (you can just match the pattern to search for, instead), so this will (most likely) do in your case:

test:[?](.*?)test:[?]
Borealid
  • 95,191
  • 9
  • 106
  • 122
  • 1
    If you're going to take that approach, then you need to change the `?!` to `?=`. – ruakh Feb 25 '12 at 00:49
  • 1
    -1, lookahead is needed. Without it every other required match would not match because `test:` already been consumed. – Qtax Feb 25 '12 at 03:24