7

I've been looking around and could not make this happen. I am not totally noob.

I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.

Example string:

abcSTARTabcSTARTabcENDabc

The expected result:

STARTabcEND

Not good:

STARTabcSTARTabcEND

I can't use backward search stuff. I am testing my regex here: www.regextester.com

Thanks for any advice.

stema
  • 90,351
  • 20
  • 107
  • 135
rrr
  • 73
  • 1
  • 4

5 Answers5

9

Try this

START(?!.*START).*?END

See it here online on Regexr

(?!.*START) is a negative lookahead. It ensures that the word "START" is not following

.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)

Update:

I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version

START(?!.*START).*END

this will match till the last "END".

stema
  • 90,351
  • 20
  • 107
  • 135
  • +1 for good answer with simple explanations of all the operators – shelleybutterfly Sep 07 '11 at 14:08
  • 2
    This will fail if there is more than one `START...END` pair in the string. (Or more precisely, it will only find the last `START...END` pair in the string.) – Tim Pietzcker Oct 05 '11 at 13:32
  • 2
    To clarify Tim's comment: your regexp will NOT match where you expect it to if there is *ANY* second occurrence of `START`, be it *before* or *after* `END` (e.g. `abcSTARTabcENDxyzSTART` will not match) – vladr Jan 23 '15 at 20:37
  • Yeah, it simply asks if there is any occurrence of start in the future and if so, will not match. This is not the wanted (described) behavior. – AturSams Jun 01 '17 at 15:31
7
START(?:(?!START).)*END

will work with any number of START...END pairs. To demonstrate in Python:

>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']

If you only care for the content between START and END, use this:

(?<=START)(?:(?!START).)*(?=END)

See it here:

>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
4

The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.

Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.

Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Nice solution (if no lookaheads possible) +1 – stema Sep 07 '11 at 12:00
  • This is what I was looking for, thanks. Indeed ... pedestrian :) but it works. I was hoping that there might be an easier way that I am missing. Sorry for not posting back earlier. – rrr Oct 05 '11 at 11:45
  • What is the last part for? Why do you need `(S(T(AR?)?)?)?` – AturSams May 31 '17 at 07:54
  • Okay! I get it... you need `...(S(T(AR?)?)?)?...` because otherwise, you have to consume characters after `S`, `ST`, `STA` and `STAR`... This is freaking genius. – AturSams May 31 '17 at 08:18
  • Not sure what you mean by that. A substring of START is allowed before the END delimiter and up through there we have been preventing these substrings from matching. – tripleee May 31 '17 at 09:25
  • I don't understand the answer. My question was why did you need to have this part `(S(T(AR?)?)?)?` but I think the reason is that otherwise you won't match something like `STARTSTAREND`. The `(S(T(AR?)?)?)?` let's you cleanly consume any substring of `STAR`that comes directly before `END`. – AturSams Jun 01 '17 at 15:49
  • Yes, exactly. Earlier in the match, we allow `STAR` if it is followed by *something* which isn't `T`, but just before the end delimiter we also allow it to be followed by *nothing*. (Using "consume" in this context is a bit weird, IMHO, though.) – tripleee Jun 01 '17 at 16:40
  • Thanks for prodding me, I think I found a bug, though it's not directly related to this. I'll try to fix it tomorrow. – tripleee Jun 01 '17 at 16:41
3

May I suggest a possible improvement on the solution of Tim Pietzcker? It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.

Steve Pettifer
  • 1,975
  • 1
  • 19
  • 34
Johannes Wentu
  • 931
  • 1
  • 14
  • 28
0

[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct. (?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END) as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]

The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.

That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:

abcSTARTdefSTARTghiENDjkl

you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:

abcSTARTdefBEGINghiFINISHjkl

which would allow you to change your START/END tokens only when paired properly.

Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.

See the Java regex documentation for full details on Java regexes.

shelleybutterfly
  • 3,216
  • 15
  • 32