0

I am looking for a C# regex solution to match/capture some small but complex chunks of data. I have thousands of unstructured chunks of data in my database (comes from a third-party data store) that look similar to this:

not BATTCOMPAR{275} and FORKCARRIA{ForkSpreader} and SIDESHIFT{WithSSPassAttachCenterLine} and TILTANGLE{4up_2down} and not AUTOMATSS{true} and not FORKLASGUI{true} and not FORKCAMSYS{true} and OKED{true}

I want to be able to split that up into discrete pieces (regex match/capture) like the following:

not BATTCOMPAR{275} 
and FORKCARRIA{ForkSpreader} 
and SIDESHIFT{WithSSPassAttachCenterLine} 
and TILTANGLE{4up_2down} 
and not AUTOMATSS{true} 
and not FORKLASGUI{true} 
and not FORKCAMSYS{true} 
and OKED{true}
CONTAINER{Container}

The data will always conform to the following rules:

  • At the end of each chunk of data there will be a string enclosed by curly braces, like this: {275}
  • The "curly brace grouping" will always come at the end of a string beginning with not or and or and not or nothing. The "nothing" is the same as and and will only occur when it's the first chunk in the string. For example, if my and OKED{true} had come at the beginning of the string, the and would have been omitted and OKED{true} would have been prefixed by nothing (empty string). But it's the same as an and.
  • After the operator (and or not or and not or nothing) there will always be a string designator that ends just before the curly brace grouping. Example: BATTCOMPAR
  • It appears that the string designator will always touch the curly brace grouping with no space in between but I'm not 100% sure. The regex should accommodate the scenario in which a space might come between the string designator and the left curly brace.
  • Summary #1 of above points: each chunk will have 3 distinct sub-groups: operator (such as and not), string designator (such as BATTCOMPAR), and curly brace grouping (such as {ForkSpreader}).
  • Summary #2 of above points: each chunk will begin with one of the 3 listed operators, or nothing, and end with a right-curly-brace. It is guaranteed that only 1 left-curly-brace and only 1 right-curly-brace will exist within the entire segment, and they will always be grouped together at the end of the segment. There is no fear of encountering additional/stray curly braces in other parts of the segment.

I have experimented with a few different regex constructions:

Match curly brace groupings:

Regex regex = new Regex(@"{(.*?)}");
return regex.Matches(str);

The above almost works, but gets only the curly brace groupings and not the operator and string designator that goes with it.

Capture chunks based on string prefix, trying to match operator strings:

var capturedWords = new List<string>();
string regex = $@"(?<!\w){prefix}\w+";

foreach ( Match match in Regex.Matches(haystack, regex) ) {
    capturedWords.Add(match.Value);
}

return capturedWords;

The above partially works, but gets only the operators, and not the entire chunk I need: (operator + string designator + curly brace grouping)

halfer
  • 19,824
  • 17
  • 99
  • 186
HerrimanCoder
  • 6,835
  • 24
  • 78
  • 158
  • Without having digged to deep into the requirements, what would be wrong with just [splitting at e.g. `(?<=})\s+`](https://regex101.com/r/LOJvCV/1) (whitespace after a closing brace) – bobble bubble Sep 27 '22 at 17:03
  • I included the example from your follow-up question in [my answer](https://stackoverflow.com/a/73945745/3832970), it works for all your inputs now. Note that `[and\s|or\s|not\s]+` matches a whitespace, or `a`, `n`, `d` etc. chars that are defined in the character class, that does not match sequences of chars. – Wiktor Stribiżew Oct 04 '22 at 12:55
  • Please let me know if my answer does not help you, I really thought it was what you needed. I will remove it if there is no value in it. – Wiktor Stribiżew Oct 05 '22 at 10:35
  • Wiktor, thanks so much for proposing a solution. What ended up solving my problem was this: `([and\\s|or\\s|not\\s|]+?.*?\\{.*?\\}|.*?\\{.*?\\})` - provided by Ryan. – HerrimanCoder Oct 06 '22 at 15:11

2 Answers2

0

You need to use

\b((?:and|or)(?:\s+not)?|not)?\s*(\w+){([^{}]*)}

See the regex demo. Details:

  • \b - a word boundary
  • ((?:and|or)(?:\s+not)?|not)? - an optional Group 1:
  • \s* - zero or more whitespaces (* means zero or more)
  • (\w+) - Group 2: one or more word chars
  • { - a { char
  • ([^{}]*) - Group 3: any zero or more chars other than { and }
  • } - a } char.

See the C# demo:

var text = "not BATTCOMPAR{275} and FORKCARRIA{ForkSpreader} and SIDESHIFT{WithSSPassAttachCenterLine} and TILTANGLE{4up_2down} and not AUTOMATSS{true} and not FORKLASGUI{true} and not FORKCAMSYS{true} and OKED{true}\nCONTAINER{Container}";
var results = Regex.Matches(text, @"\b((?:and|or)(?:\s+not)?|not)?\s*(\w+){([^{}]*)}");
foreach (Match m in results) 
{
    Console.WriteLine("{0} : {1} : {2}", m.Groups[1].Value, m.Groups[2].Value, m.Groups[3].Value);
}

Output:

not : BATTCOMPAR : 275
and : FORKCARRIA : ForkSpreader
and : SIDESHIFT : WithSSPassAttachCenterLine
and : TILTANGLE : 4up_2down
and not : AUTOMATSS : true
and not : FORKLASGUI : true
and not : FORKCAMSYS : true
and : OKED : true
 : CONTAINER : Container
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

This works for me: /([and\s|or\s|not\s]+)?.*?(\{.*?\})/mg on Regex Tester.

On DotNet Fiddle, this worked for me:

() - Capture group

[and\\s|or\\s|not\\s]+? - start with a single and, or, not or combination each followed by a whitespace

.*? any combination of characters or none, ex. BATTCOMPAR

\\{.*?\\} the final part enclosed in curly braces which contains any combination of characters or none

string test = "not BATTCOMPAR{275} and FORKCARRIA{ForkSpreader} and SIDESHIFT{WithSSPassAttachCenterLine} and TILTANGLE{4up_2down} and not AUTOMATSS{true} and not FORKLASGUI{true} and not FORKCAMSYS{true} and OKED{true}";
Regex r = new Regex("([and\\s|or\\s|not\\s]+?.*?\\{.*?\\})", RegexOptions.Multiline);

//or if you need to account for matches where there is no
//prepending words ie. and, not and
//Regex r = new Regex("([and\\s|or\\s|not\\s|]+?.*?\\{.*?\\}|.*?\\{.*?\\})", RegexOptions.Multiline);

MatchCollection matches = r.Matches(test);
        
foreach(Match m in matches)
{
    Console.WriteLine(m.Value); 
}

Prints:

//not BATTCOMPAR{275}
//and FORKCARRIA{ForkSpreader}
//and SIDESHIFT{WithSSPassAttachCenterLine}
//and TILTANGLE{4up_2down}
//and not AUTOMATSS{true}
//and not FORKLASGUI{true}
//and not FORKCAMSYS{true}
//and OKED{true}
Ryan Wilson
  • 10,223
  • 2
  • 21
  • 40
  • 3
    `[` [charclasses](https://www.regular-expressions.info/charclass.html) `]` (just mentioning) – bobble bubble Sep 27 '22 at 17:14
  • @bobblebubble Thank you. I'll check it out. Your name in reference to that old Nintendo game? :P – Ryan Wilson Sep 27 '22 at 17:16
  • Haha, yes I knew it from Commodore 64 and Aimga :) Didn't know yet it was on Nintendo too – bobble bubble Sep 27 '22 at 17:17
  • 1
    @bobblebubble Yeah. I remember when my neighborhood friend got it on the NES back in 1980 something...memory is a bit fuzzy. It must have been 1986 - [Bubble_Bobble](https://en.wikipedia.org/wiki/Bubble_Bobble) – Ryan Wilson Sep 27 '22 at 17:33
  • Ryan, this is genius, and works perfectly. Thank you for sharing your knowledge. I would love it if you explained the particulars of the regex. Don't sweat it if you don't have time. I can glean some of it, but confused by some aspects. – HerrimanCoder Sep 27 '22 at 17:42
  • @HerrimanCoder Sure thing. I'll update my answer. Glad to help :) – Ryan Wilson Sep 27 '22 at 17:47
  • Besides our cool bubbling memories... that the regex works, ist just a coincidence :) This `[and\w|or\w|not\w]` conatins the same characters as just `[\w|]` ([see demo](https://regex101.com/r/vkc03U/1))... I don't want to ruin the party :o) – bobble bubble Sep 27 '22 at 18:47
  • Whatever you put into a `[` characterclass `]`only the unique charcters in there count. If you put `\w` which already conatins `[a-zA-Z0-9_]` (as you noticed) into the class together with any words, these words are redundant. Please click here: [`[and\w|or\w|not\w]+`](https://regex101.com/r/0AdseQ/1)... I won't disturb the party further! – bobble bubble Sep 27 '22 at 19:09
  • 1
    @bobblebubble I completely goofed, when I was writing this and it worked anyway. Yeah, I'm tired and was thinking `\w` was whitespace where `\s` is whitespace. Now I understand why you were putting what you were putting. I fixed my answer to be what I intended. – Ryan Wilson Sep 27 '22 at 19:23
  • 1
    No worries, it's working anyway :) PS: See what [`[and\s|or\s|not\s]+`](https://regex101.com/r/XQzAtd/1) matches. – bobble bubble Sep 27 '22 at 19:29
  • 2
    I think what you meant to use is a [group](https://www.regular-expressions.info/refcapture.html), not a [character-class](https://www.regular-expressions.info/charclass.html), something like [`(?:and\s|or\s|not\s)+.*?\{.*?\}`](https://regex101.com/r/t3uRff/1) but this pattern would not work if the `and|or|not` part was missing and coincidence using the `\w` made it work, so some regex like [`(?:\w.*?)?\{.*?\}`](https://regex101.com/r/zeOAsZ/1) finally imho turns out to be sufficient, but it's just my guessing. Now I wish you sweet dreams and thank you for the talks :) – bobble bubble Sep 27 '22 at 19:46
  • 1
    @bobblebubble You too! Thanks for the memories. Those were some good days when I was young playing Nintendo with friends. – Ryan Wilson Sep 27 '22 at 20:08
  • How can I modify the regex to also find `CONTAINER{Container}` (with no operators before it)? – HerrimanCoder Oct 03 '22 at 19:57
  • @HerrimanCoder Are you asking for the actual words `CONTAINER{Container}` or is that your placeholder example? In any case, the regex in comments under the one in my answer will grab things with no pre-pended and, or not. See comment in answer "`//or if you need to account for matches where there is no //prepending words ie. and, not`" – Ryan Wilson Oct 03 '22 at 20:01
  • 1
    Ryan, you are right! Your alternate does work. I got thrown off because when I tried your alternate at `http://regexstorm.net/tester` it didn't work, so I was scurrying around looking for an alternative. THANK YOU again! – HerrimanCoder Oct 03 '22 at 21:07