0

How come my regex pattern isn't lazy? It should be capturing the first number, not the second.

Here is a working bash script..

#!/bin/bash

text='here is some example text I want to match word1 and this number 3.01 GiB here is some extra text and another number 1.89 GiB'

regex='(word1|word2).*?number[[:blank:]]([0-9.]+) GiB'

if [[ "$text" =~ $regex ]]; then
    echo 'FULL MATCH:  '"${BASH_REMATCH[0]}"
    echo 'NUMBER CAPTURE:  '"${BASH_REMATCH[2]}"
fi

Here is the output...

FULL MATCH:  word1 and this number 3.01 GiB here is some extra text and another number 1.89 GiB
NUMBER CAPTURE:  1.89

Using this online POSIX regex tester it is lazy as I expected. But in Bash it is greedy. The NUMBER CAPTURE should be 3.01, not 1.89.

oguz ismail
  • 1
  • 16
  • 47
  • 69
deanresin
  • 1,466
  • 2
  • 16
  • 31
  • How do I get it to match the shortest? – deanresin Aug 23 '19 at 05:37
  • It works now thanks. But why would `.*?` be undefined? Isn't it saying match anything but be lazy about it (shortest possible overall match)? And what if I don't know what characters can appear between first capture group and "number"? – deanresin Aug 23 '19 at 05:47
  • 2
    @deanresin Adding a question mark to a quantifier (as in `.*?`) to make it non-greedy is a Perl extension to regex syntax, not something that a POSIX-compliant regex system will support. That regex tester claims to be POSIX, but actually only has PCRE (perl) and Javascript options, not either POSIX basic or extended regex. – Gordon Davisson Aug 23 '19 at 05:52
  • @GordonDavisson thanks. That clears a ton of confusion. I'm definitely used to the one used with PHP (which is prob same as PERL). – deanresin Aug 23 '19 at 05:55
  • 1
    @oguzismail That is crazy there is no solution for "match anything but take the shortest". – deanresin Aug 23 '19 at 05:56

1 Answers1

3

Wrt .*?, POSIX standard says

The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.

And concerning greedy matching, it says:

If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched.

In this particular case you can use [^&]* instead.

text='here is some example text I want to match word1 and this number 3.01 GiB here is some extra text and another number 1.89 GiB'
regex='(word1|word2)[^&]*number[[:blank:]]([0-9.]+) GiB'
if [[ "$text" =~ $regex ]]; then
    echo 'FULL MATCH:  '"${BASH_REMATCH[0]}";
    echo 'NUMBER CAPTURE:  '"${BASH_REMATCH[2]}";
fi

Outputs:

FULL MATCH:  word1 and this number 3.01 GiB
NUMBER CAPTURE:  3.01
oguz ismail
  • 1
  • 16
  • 47
  • 69