5

I have text I'm trying to extract from LogicalID and SupplyChain from

 <LogicalID>SupplyChain</Logical>

At first I used the following regex:

.*([A-Za-z]+)>([A-Za-z]+)<.*

This matched as follows:

["D", "SupplyChain"]

In a fit of desperation, I tried using the asterisk instead of the plus:

.*([A-Za-z]*)>([A-Za-z]+)<.*

This matched perfectly.

The documentation says * matches zero or more times and + matches one or more times. Why is * greedier than +?

EDIT: It's been pointed out to me that this isn't the case below. The order of operations explains why the first match group is actually null.

duber
  • 2,769
  • 4
  • 24
  • 32
  • What do you mean by greedier? Have you tried changing places `.*` with `.+`? It seems that it is not greediness, but order of placing them that matters here. – Pshemo Dec 09 '13 at 17:38
  • It seemed like greediness, and it's actually order of execution. I've gathered this in the answer below from @Airos. – duber Dec 09 '13 at 17:40
  • 2
    Putting `?` after `*` in your first regex will also make this match work, i.e. `.*?([A-Za-z]+)>([A-Za-z]+)<.*` . I'm pointing that out just because it might help you see how things work, but @anubhava's answer is probably a better one, depending on your exact requirements. – ajb Dec 09 '13 at 17:42

3 Answers3

5

It's not a difference in greediness. In your first regex:

.*([A-Za-z]+)>([A-Za-z]+)<.*

You are asking for any amount of characters (.*), then at least a letter, then a >. So the greedy match has to be D, since * consumes everything before D.

In the second one, instead:

.*([A-Za-z]*)>([A-Za-z]+)<.*

You want any amount of characters, followed by any amount of letters, then the >. So the first * consumes everything up to the >, and the first capture group matches an empty string. I don't think that it "matches perfectly" at all.

Aioros
  • 4,373
  • 1
  • 18
  • 21
2

You should really be using this regex:

<([A-Za-z]+)>([A-Za-z]+)<

OR

<([A-Za-z]*)>([A-Za-z]+)<

Both will match LogicalID and SupplyChain respectively.

PS: Your regex: .*([A-Za-z]*)>([A-Za-z]+)< is matching empty string as first match.

Working Demo: http://ideone.com/VMsb6n

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 3
    I don't think this answers the question. – Konstantin Yovkov Dec 09 '13 at 17:33
  • @kocko: Please elaborate why not. I wrote that OP's regex `.*([A-Za-z]*)>([A-Za-z]+)< is matching empty string as first match.` – anubhava Dec 09 '13 at 17:34
  • 2
    The question is "Why `*` is greedier than `+` ?" – Konstantin Yovkov Dec 09 '13 at 17:34
  • 1
    @kocko: The OP's "observation" (that `*` is greedier than `+`) seems to be based on a mistake; he thought his second regex matched "perfectly" while in fact it caused the capture group to match an empty string. – ajb Dec 09 '13 at 17:35
  • @kocko: That is what I tried to focus that OP's observation that `* is greedier than +` isn't right. (Added a working demo to showcase code example also). – anubhava Dec 09 '13 at 17:37
1
Why is * greedier than +?

It doesnot shows greedness.

The first regex .*([A-Za-z]+)>([A-Za-z]+)<.* can be represented as

enter image description here

Here Group1 should need to present one or more time for a match.

And the Second .*([A-Za-z]*)>([A-Za-z]+)<.* as

enter image description here

Here Group1 should need to present Zero or more time for a match.

Rakesh KR
  • 6,357
  • 5
  • 40
  • 55