I have just been bitten by unexpected regex behavioral differences between glibc and musl. Consider the script below:
#!/usr/bin/env bash
regex=" *([a-z ]+)+ [0-9]+"
line=" the answer is 42"
if [[ $line =~ $regex ]]; then
echo "<${BASH_REMATCH[1]}>"
fi
When I run it with Bash 5.1.0(1) from a glibc-based distribution, such as Debian or Fedora, I get the following output:
<the answer is>
However, when running it with Bash 5.1.0(1) (same version) from a musl-based distribution, such as Alpine, I get:
< the answer is>
Since both are "valid" (my regex is ambiguous, the space can be matched by the first *
or inside the parentheses), this is not a bug, but a portability issue; to my knowledge, tools such as shellcheck cannot detect such problems.
Note that the regex differences mentioned in the musl page did not seem to apply to my case, since I don't seem to be using any unsupported extensions:
musl’s regex implementation is based on TRE, with significant modifications. Some popular extensions are supported, but not all; in particular, up until version 1.1.13 it lacked some of the common extensions to POSIX BRE that add ERE-like capabilities to BRE.
Is there a way to detect such non-portable cases? Is there for example a built-in "ambiguity detector" for regexes, or some rule of thumb to avoid writing them in the first place?