0

The following regex works fine for PCRE, Java and .NET but does no work with Python or Golang. Any clues how to cover those last two would be very much appreciated. Preferable to have one regex to fit them all, but I suspect that two or more will be required.

https://regex101.com/r/E0iVVS/1

^(?<VersionTripple>(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>0|[1-9][0-9]*)){1}(?<Tags>(?:\-(?<Prerelease>(?:(?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})|(?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)|(?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+)){1}(?:\.(?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})|\.(?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)|\.(?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+))*){1}){0,1}(?:\+(?<Meta>(?:[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))){0,1})$

Expanded with some comments:

^
(?<VersionTripple>
    (?# Version fields are either zero or numeric with no leading zeros.)
    (?<Major>0|[1-9][0-9]*)
    \.
    (?<Minor>0|[1-9][0-9]*)
    \.
    (?<Patch>0|[1-9][0-9]*)
){1}
(?<Tags>
    (?# Either a prerelease tag or a prerelease tag followed by a build meta tag.)
    (?:
        (?# Prelease tags are preceded by a single dash and have one or more dot seperated identifiers.)
        \-
        (?<Prerelease>
            (?:
                (?# Look ahead, is it just zero? Then claim a zero.)
                (?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})
                |
                (?# Look ahead, is it pure numeric? Then claim the digits.)
                (?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)
                |
                (?# Look ahead, is it alphanumeric? Take them all.)
                (?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+)
            ){1}
            (?:
                (?# Look ahead, is it just zero? Then claim a zero.)
                \.(?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})
                |
                (?# Look ahead, is it pure numeric? Then claim the digits.)
                \.(?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)
                |
                (?# Look ahead, is it alphanumeric? Take them all.)
                \.(?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+)
            )*
        ){1}
    ){0,1}
    (?:
        (?# Build meta tag is preceded by a plus symbol and followed by one or more dot seperated fields)
        \+
        (?<Meta>(?:[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))
    ){0,1}
)
$

Test data:

0.0.4
1.2.3
10.20.30
1.1.2-prerelease+meta
1.1.2+meta
1.1.2+meta-valid
1.0.0-alpha
1.0.0-beta
1.0.0-alpha.beta
1.0.0-alpha.beta.1
1.0.0-alpha.1
1.0.0-alpha0.valid
1.0.0-alpha.0valid
1.0.0-alpha-a.b-c-somethinglong+build.1-aef.1-its-okay
1.0.0-rc.1+build.1
2.0.0-rc.1+build.123
1.2.3-beta
10.2.3-DEV-SNAPSHOT
1.2.3-SNAPSHOT-123
1.0.0
2.0.0
1.1.7
2.0.0+build.1848
2.0.1-alpha.1227
1.0.0-alpha+beta
1.2.3----RC-SNAPSHOT.12.9.1--.12+788
1.2.3----R-S.12.9.1--.12+meta
1.2.3----RC-SNAPSHOT.12.9.1--.12
1.0.0+0.build.1-rc.10000aaa-kk-0.1
99999999999999999999999.999999999999999999.99999999999999999
Begin Invalid

1
1.2
1.2.3-0123
1.2.3-0123.0123
1.1.2+.123
+invalid
-invalid
-invalid+invalid
-invalid.01
alpha
alpha.beta
alpha.beta.1
alpha.1
alpha+beta
alpha_beta
alpha.
alpha..
beta\
1.0.0-alpha_beta
-alpha.
1.0.0-alpha..
1.0.0-alpha..1
1.0.0-alpha...1
1.0.0-alpha....1
1.0.0-alpha.....1
1.0.0-alpha......1
1.0.0-alpha.......1
01.1.1
1.01.1
1.1.01
1.2
1.2.3.DEV
1.2-SNAPSHOT
1.2.31.2.3----RC-SNAPSHOT.12.09.1--..12+788
1.2-RC-SNAPSHOT
-1.0.3-gamma+b7718
+justmeta
9.8.7+meta+meta
9.8.7-whatever+meta+meta
99999999999999999999999.999999999999999999.99999999999999999----RC-SNAPSHOT.12.09.1--------------------------------..12
Rabbid76
  • 202,892
  • 27
  • 131
  • 174
jwdonahue
  • 6,199
  • 2
  • 21
  • 43
  • 1
    With Go, you need to install the [go-pcre library](https://github.com/d4l3k/go-pcre). With Python re, just [replace all `(?)` to `(?P)`](https://regex101.com/r/hBiuWD/1). – Wiktor Stribiżew May 22 '18 at 06:30
  • Amazing: This regex matches in ~3500 steps in PCRE/PHP but it takes ~600.000 steps in Python's re. – wp78de May 22 '18 at 06:50
  • strange the parts in Prerelease lookaheads : `[0-9A-Za-z-]{0}`, what `{0}` is for? maybe you wanted a negative lookahead at this point – Nahuel Fouilleul May 22 '18 at 07:50
  • Awesome, thanks so much for the help, all of you! – jwdonahue May 22 '18 at 17:34
  • 1
    @WiktorStribiżew, those are the majic bullets! Post as answer, I'll be happy to accept. – jwdonahue May 22 '18 at 17:34
  • @wp78de, you should see some of our earlier attempts at this. On .NET the backtracking was so problematic that some nearly valid strings were probably not ever going to stop backtracking (I let one instance run for several hours). Sort of relearning regex all over again after several years of not using them. – jwdonahue May 22 '18 at 17:34
  • @NahuelFouilleul, yes, the `[0-9A-Za-z-]{0}` is intended to be a negative match, the rule for numeric fields is no leading zeros, but a single zero value must be allowed. – jwdonahue May 22 '18 at 17:34
  • Glad to help. There are can really be other PCRE libraries for Go, but I do not know which one is better, I rarely use regex in Go. The point here is that Go regex package just does not support lookarounds, neither lookbehinds nor lookaheads. – Wiktor Stribiżew May 22 '18 at 20:58
  • I wonder why the downvote? – jwdonahue May 26 '18 at 17:12

1 Answers1

1

Following advice from @WiktorStribiżew, I modified the named capture groups and removed the redundant quantifiers. It's now working at regex101.com (v3) on everything but golang. Looks like we'll just have to advise the golang users to grab the go-pcre library.

I will update this post if I become aware of any more efficient/correct way to do this.

^(?<VersionTripple>(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>0|[1-9][0-9]*))(?<Tags>(?:\-(?<Prerelease>(?:(?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})|(?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)|(?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+)){1}(?:\.(?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})|\.(?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)|\.(?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+))*)){0,1}(?:\+(?<Meta>(?:[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))){0,1})$

Expanded version with comments:

^
(?<VersionTripple>
    (?# Version fields are either zero or numeric with no leading zeros.)
    (?<Major>0|[1-9][0-9]*)
    \.
    (?<Minor>0|[1-9][0-9]*)
    \.
    (?<Patch>0|[1-9][0-9]*)
)
(?<Tags>
    (?# Either a prerelease tag or a prerelease tag followed by a build meta tag.)
    (?:
        (?# Prelease tags are preceded by a single dash and have one or more dot seperated identifiers.)
        \-
        (?<Prerelease>
            (?:
                (?# Look ahead, is it just zero? Then claim a zero.)
                (?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})
                |
                (?# Look ahead, is it pure numeric? Then claim the digits.)
                (?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)
                |
                (?# Look ahead, is it alphanumeric? Take them all.)
                (?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+)
            ){1}
            (?:
                (?# Look ahead, is it just zero? Then claim a zero.)
                \.(?=[0]{1}[0-9A-Za-z-]{0})(?:[0]{1})
                |
                (?# Look ahead, is it pure numeric? Then claim the digits.)
                \.(?=[1-9]{1}[0-9]*[A-Za-z]{0})(?:[0-9]+)
                |
                (?# Look ahead, is it alphanumeric? Take them all.)
                \.(?=[0-9]*[A-Za-z-]+[0-9A-Za-z-]*)(?:[0-9A-Za-z-]+)
            )*
        )
    ){0,1}
    (?:
        (?# Build meta tag is preceded by a plus symbol and followed by one or more dot seperated fields)
        \+
        (?<Meta>(?:[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))
    ){0,1}
)
$
jwdonahue
  • 6,199
  • 2
  • 21
  • 43
  • 1
    There are other PCRE modules for go-lang. I cannot tell which one is the best, but most are forked from https://github.com/glenn-brown/golang-pkg-pcre (which is probably outdated) – wp78de May 22 '18 at 18:57
  • @jwdonahue, is it more efficient to have the lookaheads? On which platforms are you forced to have `{0,1}` instead of `?` and `{1}` instead of nothing? Or does this improves performance? Also, the build tag can exist without a pre-release tag. – gvlx May 25 '18 at 08:02
  • @gvlx, I don't think it is possible to match all the legal semver strings and fail all the strings that aren't, without the lookaheads. That said, there are some semver regex's out there that don't have them and get everything but rules regarding leading zeros in prerelease numeric fields right, but hand them just one degenerate negative example, and the backtracking is so severe that it can take hours for them to finally give up, if they ever do. – jwdonahue May 26 '18 at 00:38
  • @gvlx, as for the {0,1} vs ? choices, it's mostly that I am lesdyxic and aphasic, and {0,1} is how I get it right when I can't decide between + and ?. As far as the regex parser is concerned, the question mark is fewer characters, but the overhead of the quantitative range syntax, really isn't too bad. – jwdonahue May 26 '18 at 00:42
  • @jwdonahue, may you please add those examples of severe backtracking in the tests? – gvlx May 29 '18 at 11:39
  • That would be the last entry for sure. There's one or two others that are a bit shorter (look for "..12" and/or lots of dashes. – jwdonahue May 29 '18 at 18:35