5

I have the following input text:

@"This is some text @foo=bar @name=""John \""The Anonymous One\"" Doe"" @age=38"

I would like to parse the values with the @name=value syntax as name/value pairs. Parsing the previous string should result in the following named captures:

name:"foo"
value:"bar"

name:"name"
value:"John \""The Anonymous One\"" Doe"

name:"age"
value:"38"

I tried the following regex, which got me almost there:

@"(?:(?<=\s)|^)@(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>[A-Za-z0-9_-]+|(?="").+?(?=(?<!\\)""))"

The primary issue is that it captures the opening quote in "John \""The Anonymous One\"" Doe". I feel like this should be a lookbehind instead of a lookahead, but that doesn't seem to work at all.

Here are some rules for the expression:

  • Name must start with a letter and can contain any letter, number, underscore, or hyphen.

  • Unquoted must have at least one character and can contain any letter, number, underscore, or hyphen.

  • Quoted value can contain any character including any whitespace and escaped quotes.

Edit:

Here's the result from regex101.com:

(?:(?<=\s)|^)@(?<name>\w+[A-Za-z0-9_-]+?)\s*=\s*(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)"))

(?:(?<=\s)|^) Non-capturing group
@ matches the character @ literally
(?<name>\w+[A-Za-z0-9_-]+?) Named capturing group name
\s* match any white space character [\r\n\t\f ]
= matches the character = literally
\s* match any white space character [\r\n\t\f ]
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?<value>(?<!")[A-Za-z0-9_-]+|(?=").+?(?=(?<!\\)")) Named capturing group value
    1st Alternative: [A-Za-z0-9_-]+
        [A-Za-z0-9_-]+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            A-Z a single character in the range between A and Z (case sensitive)
            a-z a single character in the range between a and z (case sensitive)
            0-9 a single character in the range between 0 and 9
            _- a single character in the list _- literally
    2nd Alternative: (?=").+?(?=(?<!\\)")
        (?=") Positive Lookahead - Assert that the regex below can be matched
            " matches the characters " literally
        .+? matches any character (except newline)
            Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
        (?=(?<!\\)") Positive Lookahead - Assert that the regex below can be matched
            (?<!\\) Negative Lookbehind - Assert that it is impossible to match the regex below
                \\ matches the character \ literally
            " matches the characters " literally
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Have you considered using [JSON](https://github.com/JamesNK/Newtonsoft.Json/) instead? – yazanpro May 04 '15 at 00:34
  • Side note: consider if there is existing parser for whatever you are trying to parse (SQL?)... At very least re-format and annotate your regular expression so average person can reason about it (easy way is to use https://regex101.com/ and than clean up explanation a bit)... – Alexei Levenkov May 04 '15 at 00:44
  • JSON is not an option. This is not for SQL or any existing technology which has an existing parser. This is a very specific use case. – Anthony Grescavage May 04 '15 at 01:06

2 Answers2

1

You can use a very useful .NET regex feature where multiple same-named captures are allowed. Also, there is an issue with your (?<name>) capture group: it allows a digit in the first position, which does not meet your 1st requirement.

So, I suggest:

(?si)(?:(?<=\s)|^)@(?<name>\w+[a-z0-9_-]+?)\s*=\s*(?:(?<value>[a-z0-9_-]+)|(?:"")?(?<value>.+?)(?=(?<!\\)""))

See demo

Note that you cannot debug .NET-specific regexes at regex101.com, you need to test them in .NET-compliant environment.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I didn't realize you could use multiple capture groups that way. That completely solves my problem. Thanks! – Anthony Grescavage May 04 '15 at 11:28
  • The only thing I don't understand at this point is why the last part of the expression didn't work in the original. If I used a literal quote, it would capture the expression (including the quote, which I didn't want). If I used a lookbehind, the expression would capture if used in isolation, but would not work when added to the full expression. I blame my lack of knowledge of how the lookbehind functions, but I'm not really certain. Any additional knowledge here would be useful for educational purposes. – Anthony Grescavage May 04 '15 at 11:31
  • I tend to use regex hero for a quick online test. It's Silverlight based, so it's fairly reliable as a means of quick and dirty testing. – Anthony Grescavage May 04 '15 at 11:35
  • Regexstorm is good, too, I tested both, and I use Expresso, too. – Wiktor Stribiżew May 04 '15 at 11:35
  • I also fixed the name capture to look like `(?[a-z][a-z0-9_-]*?)`. – Anthony Grescavage May 04 '15 at 11:44
  • Thanks all around. You have been extremely helpful. – Anthony Grescavage May 04 '15 at 11:45
  • From what I can see `(?=").+?` is not what you wanted because the positive *look-ahead* was checking if we have a `"` right after a `=` or `= `. So, it was tested for, and then captured. A look-behind `(?<=")` would also just check if there is a `"` before the first character after the equal sign, and then it does not match. Removing the look-around just forces the `"` to be consumed. – Wiktor Stribiżew May 04 '15 at 11:46
  • I just want to note that I am using inline flags `(?si)` to force ignorecase and singleline modes. You can use `RegexOptions.Singleline` and `RegexOptions.Ignorecase` if you do not want to use those inline options. – Wiktor Stribiżew May 04 '15 at 11:48
0

Use string methods.

Split

string myLongString = ""@"This is some text @foo=bar @name=""John \""The Anonymous One\"" Doe"" @age=38"

string[] nameValues = myLongString.Split('@');

From there either use Split function with "=" or use IndexOf("=").

Mukus
  • 4,870
  • 2
  • 43
  • 56