3

I have just been bitten by unexpected regex behavioral differences between glibc and musl. Consider the script below:

#!/usr/bin/env bash

regex=" *([a-z ]+)+ [0-9]+"
line="  the answer is 42"

if [[ $line =~ $regex ]]; then
    echo "<${BASH_REMATCH[1]}>"
fi

When I run it with Bash 5.1.0(1) from a glibc-based distribution, such as Debian or Fedora, I get the following output:

<the answer is>

However, when running it with Bash 5.1.0(1) (same version) from a musl-based distribution, such as Alpine, I get:

<  the answer is>

Since both are "valid" (my regex is ambiguous, the space can be matched by the first * or inside the parentheses), this is not a bug, but a portability issue; to my knowledge, tools such as shellcheck cannot detect such problems.

Note that the regex differences mentioned in the musl page did not seem to apply to my case, since I don't seem to be using any unsupported extensions:

musl’s regex implementation is based on TRE, with significant modifications. Some popular extensions are supported, but not all; in particular, up until version 1.1.13 it lacked some of the common extensions to POSIX BRE that add ERE-like capabilities to BRE.

Is there a way to detect such non-portable cases? Is there for example a built-in "ambiguity detector" for regexes, or some rule of thumb to avoid writing them in the first place?

anol
  • 8,264
  • 3
  • 34
  • 78
  • 1
    Thanks, I used your formulation (also adding about a "rule of thumb for non-ambiguity") – anol Feb 09 '22 at 18:06

0 Answers0