2

Explicit rules:

  1. The string has two parts, separated using a semicolon.

  2. The first part is allowed to have alphanumeric characters, dashes, underscores and dots

  3. The second part of the string contains key-value pairs where key is set to its value using an equality sign and the pairs are comma separated and we don't know how many times they're repeated beforehand

Examples:

  • blahblahblah;first=1,second=two
  • bl.hbl-hbl_hbl4hbl4h;first=1,second=two,third=thr33

The best I've come up with so far is ([A-Za-z1-9_\-\.]+);(((.+?)(?:,|$))+) which is obviously far from correct. I am not good at writing regexps with lookaheads, lookbehinds, and other relatively advanced stuff in regex but I hope that a regex solution exists for this problem.

If regex engine matters, I am using the Perl-compatible regex engine in PHP 8.1

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Your regex does work: https://regex101.com/r/46xDV6/1 what isn't it doing that you want? – sniperd May 25 '23 at 14:02
  • @sniperd It is capturing ```second=two;third=thr33``` as one group while I want them to be captured separately. Also, it can capture unwanted things as well. For example, it can capture this: ```bl.hbl-hbl_hbl4hbl4h;first=1,ssssssssss``` which is clearly not formatted correctly – PizzaIsLove May 25 '23 at 14:09
  • 1
    You said _"pairs are comma separated"_ but also that `second=two;third=thr33` should be two pairs. You can't have both. You also said _"has two parts, separated using a semicolon"_ but then you show us a string with three parts. – Alex Howansky May 25 '23 at 14:31
  • @AlexHowansky I'm terribly sorry. It was a typo. I fixed it. – PizzaIsLove May 25 '23 at 14:33

1 Answers1

3

You can try with the following regex:

^([\w.-]+);([A-Za-z]\w*=\w+(?:,[A-Za-z]\w*=\w+)*)$

Regex Explanation:

  • ^: start of string
  • ([\w.-]+): first string, made of alphanumeric characters, underscores, dashes and dots
  • ;: semicolon
  • ([A-Za-z]\w*=\w+(?:,[A-Za-z]\w*=\w+)*): key-value pairs
    • [A-Za-z]: alphabetical character
    • \w+: sequence of alphanumerical characters
    • =
    • \w+: sequence of alphanumerical characters
    • (?:,[A-Za-z]\w+=\w+)*: non-capturing group with the optional next key-value pairs
      • ,: comma
      • [A-Za-z]: alphabetical character
      • \w+: sequence of alphanumerical characters
      • =
      • \w+: sequence of alphanumerical characters
  • $: end of string

Check the demo here.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • 2
    You could write the first part as `([\w.-]+)` because `\w` also matches an underscore, and it does not have to be non greedy because the characters in the character class can not cross matching `;` Also note that `[A-Za-z]\w+` matches at least 2 characters. You could also omit the superfluous non capture group in the second part, like `^([\w.-]+);([A-Za-z]\w+=\w+(?:,[A-Za-z]\w+=\w+)*)$` – The fourth bird May 25 '23 at 14:57
  • Thank you for the thorough explanation of how the regex you provided works. Much appreciated. – PizzaIsLove May 25 '23 at 14:58
  • There's always something to learn from you @Thefourthbird. The only issue I'm having with regex101 is that it looks like dash needs to be escaped inside the characters selection, probably because it attempts to catch a range. – lemon May 25 '23 at 15:00
  • 1
    @lemon In that case you can put it at the end of the pattern. – The fourth bird May 25 '23 at 15:01
  • 1
    Looks definitely better, thanks a lot. – lemon May 25 '23 at 15:04
  • Was trying to make a [regex](https://regex101.com/r/KKs9VU/1) that matched all key-value pairs separately, exploiting `\G`. Although this one would match also key-value pairs in non-correct strings. Do you have any hint on how to prevent it? – lemon May 25 '23 at 15:12
  • @lemon Sorry, I did not see your comment. Did you mean something like this with `\G` ? https://regex101.com/r/h5rtP4/1 – The fourth bird May 25 '23 at 17:24
  • 1
    Yeah, it may only require an additional non-capturing group to make comma+potential key-value pairs optional, to match if only one key-value pair appears in the string, something like [this](https://regex101.com/r/h5rtP4/2). I've tried with lookarounds with no success for a bit of time, although looks like you nailed it. – lemon May 25 '23 at 17:33
  • 1
    Maybe if we do the positive lookahead right after the first alternative, it might be less steps when there are more key/value pairs. https://regex101.com/r/vT25lO/1 – The fourth bird May 25 '23 at 20:03