0

I've seen several questions similar, even one i posted myself, but this is rather specific.

In regex there is a match pattern. Now say in the same string there are two match patterns that can both match text. It seems my luck always leans towards the regex matching the wrong pattern. (I am using the .Net Regex in C#)

I have two types of strings that I need to break down:

01 - First Value|02 - Second Value|Blank - Ignore

And:

A - First ValueblankB - Second ValueC - Third Value

So my desired result is to match Code to Meaning with one pattern string

Code,Meaning
01,First Value
02,Second Value
Blank,Ignore
A,First Value
blank,
B,Second Value
C,Third Value

I have tried several patterns but can never seem to quite get it right. The closest I have have been able to get is:

(([A-Z0-9]{1,4})[ \-–]{1,3}|([Bb]lank)[ \-–]{0,3})(([A-Z][a-z]+[.,;| ]?)+)

My breakdown:

  • [A-Z0-9]{1,4}[ \-–]{1,3} --> this matches the code, Upper case, or number of length 1 - 4 characters followed by 1 to 3 chars of space, hyphen, or mdash from html.

or

  • [Bb]lank[ \-–]{0,3} --> blank followed 0-3 chars of space, hyphen, or mdash from html

then

  • (([A-Z][a-z]+[.,;| ]?)+) --> should match any multiple word including possible space. so the First and Value, Second and Value should be matched.

The initial problem with that is the final pattern group matches the "Valueblank" in the second input string. I want to somehow prioritize that "[Bb]lank" should be matched as part of the first group and NEVER part of the second group.
I tried putting a (?![Bb]lank) negative lookahead in the finalgroup but it never seems to work. Any help would be appreciated.

Thanks

Jaeden "Sifo Dyas" al'Raec Ruiner

JaedenRuiner
  • 329
  • 4
  • 18

2 Answers2

1

How about the following (regex101.com example):

/((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)(?:\h[-–]\h|\|)?(.*?)(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)/gm

Explanation

[Bb]lank

All matches for "blank" check for a lower OR uppercase "B"

((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)

The 1st capturing group: match either the alpha numeric first value or a "blank" first value with " - " or " – " after (positive lookahead) OR a "blank" first value that won't have a 2nd matching group.

(?:\h[-–]\h|\|)?

A separator of " - " OR " – " OR "|" which will occur zero or one times.

(.*?)

Ungreedily match the 2nd matching group.

(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)

Using a positive lookahead,look for a "blank" OR "|" OR alpha numeric first value with " - " or " – " after OR the end of the line (to catch the last item on the row) to find the end of where we should capture

Phil Young
  • 1,334
  • 3
  • 21
  • 43
  • what is \h? I don't recognize that regex token. – JaedenRuiner Jan 17 '18 at 00:39
  • It's a shorthand character for a horizontal white space. You can also negate this by capitalising it like `\H`. Likewise, you can also do the same for vertical white spaces with `\v` and `\V`. – Phil Young Jan 17 '18 at 00:45
  • I'm not a C# regexer normally, but in your circumstance you could probably replace `\h` with `\s` if its causing you problems, like https://regex101.com/r/TzjG9d/2 – Phil Young Jan 17 '18 at 00:47
  • i just did " ?" for possible space. \s in C# matches \r or \n so no. That being said, this was awesome. I updated it to add an additional options [Ss]paces? which acts like the "[Bb]lank" word match. but here is a small twist: `((?:[A-Z0-9]{1,4}|[Bb]lank|[Ss]paces?)(?= ?[-–] ?)|[Bb]lank|[Ss]paces?)` was my update, but i found that occasionally in the strings i'm parsing I'll get "Blank/Spaces" as a singular code, I can transpose the update into the regex, but is there a better way to check for that other than another alternation? i.e.: [...]`|[Bb]lank|[Ss]paces?|[Bb]lank/[Ss]paces` – JaedenRuiner Jan 17 '18 at 16:07
  • Hmm, possibly. Could you update my regex101 link with your new regex and further examples to demo these cases? – Phil Young Jan 17 '18 at 19:53
0

Regex will pick the first longest match, that is if two patterns start matching at the same position and match the same number of characters the earlier alternative will be chosen.

for example, the following (silly example) will always match the first alternative in preference to the second: (.+)|foo

In your case if you actually want to match two items where one starts with a number and one with a letter, why not do: ([0-9]+...)|([A-Za-z]....)

Match the two alternates as early as possible.

SoronelHaetir
  • 14,104
  • 1
  • 12
  • 23
  • well, it isn't a matter of the number or character, it is just the ability to flag out that the "blank" in "Accidentblank" is not part of the group 2 word matches, it is part of the group 1 code match. right now i was forced to do a two stage process, replacing (\w+)(blank)(\w+) with $1|$2|$3 to force the separation, but am curious to know if there is a way to detect it with one pass instead of two. – JaedenRuiner Jan 16 '18 at 22:47