0

Suppose I have the following test string:

Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop

where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....

What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.

Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop

I precise that I'd like to do this only using regex and as far as possible in a single pass.

Any suggestions are welcome

Thanks'

Jerome
  • 25
  • 4
  • "I precise that I'd like to do this only using regex and as far as possible in a single pass." -- why? And what flavour of regex is this? (since different versions support different constructs) – Peter Boughton Jul 26 '10 at 13:41
  • Regex because I need to extend an existing generic tool developed using regex. It uses .NET Framework System.Text.RegularExpressions, but I cannot say exactly which flavour it is... Probably Microsoft's one. – Jerome Jul 26 '10 at 13:50
  • Microsoft has (at least) two different flavours, but saying it's .NET Framework should be enough to narrow it down. – Peter Boughton Jul 26 '10 at 14:00

5 Answers5

1
Get(?=(?:(?!Get|Start|Stop).)*Stop)

I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • That's exactly what I needed! Thanks' Alan Moore. – Jerome Jul 26 '10 at 20:25
  • Hi Alan, I've tried this variant of your solution: Get(?=(?:(?!Get).)*Stop) and it seems to be working too. What is the need for the alternation (Get|Start|Stop) since (assuming delimiters are correctly balanced as you mention) the requirement is to have no other Get between the searched Get and the suffix ? – Jerome Jul 27 '10 at 08:04
  • `Start` is to prevent matching a `Get` that's not between delimiters, like `Get_Start_Stop`. As for `Stop`, suppose there's a whole bunch of text after the last `Stop`. You don't want the `.*` to go all the way to the end, only to have to backtrack most of that distance to match the `Stop`. Lookaheads can be slippery; it's worth a little extra care to make sure they only look as far ahead as you need them to. – Alan Moore Jul 27 '10 at 09:13
0

I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.

PolyThinker
  • 5,152
  • 21
  • 22
  • Thanks' PolyThinker, but I can handle it in two steps as you suggest, but I wonder if it would be possible in a single pass... – Jerome Jul 26 '10 at 13:33
0
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
0

Something like this, maybe:

(?<=Start(?:.Get)*)Get(?=.Stop)

That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.

Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
  • AFAIK, the `{0,99}` trick only works in Java (i.e., it supports bounded variable-length lookbehind). But you're in luck: the OP is using .NET, one of the two flavors that support *unbounded* lookbehind (the other being JGSoft). – Alan Moore Jul 26 '10 at 20:18
0

With Perl, i'd do :

my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;

output:

Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop

You should adapt to your regex flavour.

Toto
  • 89,455
  • 62
  • 89
  • 125