4

If I have an unknown string of the structure:

"stuff I don't care about THING different stuff I don't care about THING ... THING even more stuff I don't care about THING stuff I care about"

I want to capture the "stuff I care about" which will always be after the last occurrence of THING. There is the potential for 0 occurrences of THING, or many. If there are 0 occurrences then there is no stuff I care about. The string can't start or end with THING.

Some possible strings:

"stuff I don't care about THING stuff I care about"

"stuff I don't care about"

Some not possible strings:

"THING stuff I care about"

"stuff I don't care about THING stuff I don't care about THING"


My current solution to this problem is to use a regex with two greedy quantifiers as follows:

if( /.*THING(.*)/ ) {
    $myStuff = $1;
}

It seems to be working, but my question is about how the two greedy quantifiers will interact with each other. Is the first (leftmost) greedy quantifier always "more greedy" than the second?

Basically am I guaranteed not to get a split like the following:

"stuff I don't care about THING"

$1 = "different stuff I don't care about THING even more stuff I don't care about THING stuff I care about"

Compared to the split I do want:

"stuff I don't care about THING different stuff I don't care about THING even more stuff I don't care about THING"

"stuff I care about"

noah
  • 2,616
  • 13
  • 27

3 Answers3

10

Regex returns the longest leftmost match. The first wildcard will initially match through to the end of line, then successively backtrack a character at a time until the rest of the regex yields a match, i.e. so that the last THING in the string is matched.

tripleee
  • 175,061
  • 34
  • 275
  • 318
3

During the matching process, .*THING will initially match everything up to and including the last occurrence of THING

If there is no way the rest of the pattern can match, it will backtrack by becoming shorter, and match everything up to and including the last but one occurrence of THING, and again attempt the rest of the pattern

However the rest of the pattern is .* which will always match because it will match an empty string

Therefore, .*THING(.*) will match up to and including the last occurrence of THING, and will match and capture the rest of the string

Note that . will match anything except newlines. If there could be newlines in your text then you will want to use the /s modifier to get it to match anything at all

Note also that if the pattern fails to match (because, say, there is no THING in the string) then $1 will remain unchanged. It will still contain whatever it was set to by the most recent successful pattern match. This means that you must check the status of the pattern match before using the value of $1

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks for the answer. There is no newline characters so I won't need to worry about those. Also, I think the if statement should prevent the issue of `$1` being unchanged and then assigned. – noah Jul 25 '18 at 16:48
  • @noah: That depends what `$myStuff` is set to beforehand, and how you use it. – Borodin Jul 25 '18 at 17:19
0

Here is my take.

/^(?!THING).+THING((?:(?!THING).)+)$/

Accepts a string with 1 or more occurrences of THING. THING cannot be at the beginning or end of the string. It gets the text after the last time THING appears.

Edit: Added check for 'THING' at the beginning of the string.

EDIT: Wow, rereading your specs (that I really misread). You said If there are 0 occurrences then there is no stuff I care about. The string can't start or end with THING.

Then your regex is fine. tripleee explained the situation well.

Chris Charley
  • 6,403
  • 2
  • 24
  • 26
  • Out of curiosity, any idea how this would compare in terms of run time to the method I used in my question? – noah Jul 24 '18 at 23:37
  • @noah no, and sorry if I failed to answer your questions (as Borodin has). Your regular expression matches a string beginning with THING and I thought you wanted to not make that match. I think the only way to measure the performance would be to construct a test. – Chris Charley Jul 25 '18 at 00:54
  • Your take was helpful. Good to always compare methods. My input is guaranteed to not have THING at the beginning, so matching that is not an issue. Your method has some error checking which is nice, but fortunately I have a very well defined input so I can use a slightly more readable expression (I think at least) in exchange for a more broad set of strings that it would match. – noah Jul 25 '18 at 16:54
  • This *looks like* what beginners erroneously end up with when they want "nothing can match `THING` before we mention it explicitly"... Maybe emphasize that this doesn't do that? – tripleee Jul 26 '18 at 05:06