6

I have two sentences as input. Let's say for example:

<span>I love my red car.</span>
<span>I love my car.</span>

Now I want to match every textpart inside the span-tags AND if available the color.

If I use the following regex:

/<span>(.*?)(?P<color>red)(.*?)<\/span>/ms

Only the line with the color is matched. So I thought let's use ?-operator (for one or zero).

/<span>(.*?)(?P<color>red)?(.*?)<\/span>/ms

Now both lines/sentences will be matched. Sadly the color isn't matched anymore.

The question is why? By using ".*?" before the color part, I thought I had made the regex non-greedy, so that the color part would match, if it's existent. But as told, it doesn't...

netblognet
  • 1,951
  • 2
  • 20
  • 46
  • Regex + markup go together like petrol and mules: though both useful, they don't work well together. Use `DOMDocument` – Elias Van Ootegem Sep 18 '13 at 07:30
  • 1
    @EliasVanOotegem here DOMDocument is not the point since matter is about parsing `I love my red car` string, which is just plain text. – Alma Do Sep 18 '13 at 07:33
  • @AlmaDoMundo _"I want to match every textpart inside the span-tags"_ => Who's to say that the snippet provided isn't part of a bigger string of markup, containting div tags? – Elias Van Ootegem Sep 18 '13 at 07:35
  • 1
    @EliasVanOotegem I think it's irrelevant, since the question is the same regardless of whether this is in HTML or not, as long as it's "something between two somethings". – Nicole Sep 18 '13 at 07:39
  • What does this have to do with the 'one or zero' regex operator? – WiseOldDuck Sep 11 '14 at 21:44

2 Answers2

5

The first (.*?) will match between > and I and since it's lazy, it'll test the next part of the regex immediately: (?P<color>red)? but there's no red at that point, so the 0 option of ? 'activates' and the regex continues to the next part, which is (.*?). It'll again match the part between > and I and since it's lazy, it'll check the next part of the regex: <\/span> (I'm taking it as a whole).

So the second (.*?) will match all the way there.

Indeed, your results[1] will be null, as will be results[color] (I don't remember if you have to quote color or not) and results[3] will contain I love my red car..

Hmm, one workaround is to use OR like NickC mentioned in his answer. Another you might use is by using a negative lookahead to check for each character:

<span>((?:(?!\bred\b).)*(?<colour>\bred\b)?.*)<\/span>

regex101 demo

As a side note, I would advise using the word boundaries so that you don't match things like reduce or jarred.

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • 1
    Thank you for the explanation on why it doesn't work! I do like my solution as it doesn't require double entry of the possible values :) – Nicole Sep 18 '13 at 07:37
  • @NickC I was about to post something a bit like yours and then you posted your answer before I could; I just didn't want to have the same regex ^^; – Jerry Sep 18 '13 at 08:57
2

This should work:

/<span>(.*?(?P<color>red).*?|.*?)<\/span>/ms

Your original expression was pretty good. I modified it slightly to make a new outer group match the whole sentence. I used that new outer group to create an "or" condition to match "anything", in case the color is not present.

Abbreviated output:

Array
    [0] => Array
            [0] => <span>I love my red car.</span>
            [1] => <span>I love my car.</span>

    [1] => Array
            [0] => I love my red car.
            [1] => I love my car.

    [color] => Array
            [0] => red
            [1] => 
Nicole
  • 32,841
  • 11
  • 75
  • 101