Regular Expression for irregularly occurring repeating string

Question

I searched but have not found an answer to the question - maybe it is so obvious that no one else had to ask...

I am using UltraEdit 16.00 to run my Regular Expressions in PERL mode...

Situation:

I have a delimited string that can contain a variable number of repeating segments that must adhere to a very specific format. These segments occur randomly throughout the delimited string.

Example:

CLP*data*data*data~REF*data*data~N1*data*data*data~**CAS*OA*29*99.99**~AMT*I*99.99~SVC*data*data*data*data~**CAS*PR*99.99**~**CAS*CO**99.99**~DTM*150*date~AMT*B6*99.99~SVC*data*data*data*data~CAS*PR*N16*99.99~**CAS*CO* *99.99**...line continues from here.

Correct format - CAS*OA*29*99.99~
Incorrect format 1 - CAS*OA* *99.99~
Incorrect format 2 - CAS*OA**99.99~

Goal:

Identify only those strings where ALL of the CAS segments adhere to the format.

Things I've Tried:

(BTW: I know my Regular Expressions are not optimized, so please give me a break)

CAS Segment Missing value or containing one or more spaces

CAS\*(OA|PR|CR|CO)\*\*[-]?[\d]+\.?[\d]{0,2}~ matches the first instance if finds
CAS\*(OA|PR|CR|CO)\*[\s]+?\*[-]?[\d]+\.?[\d]{0,2}~ matches the first instance if finds

CAS segment NOT Missing value or containing space(s)

CAS\*(OA|PR|CR|CO)\*[^0-9A-Z]+?\*[-]?[\d]+\.?[\d]{0,2}~ Again, matches first instance

Negative Lookahead using combinations of the above (I am new to trying this approach)

^(?:(?!ab).)+$ - ab => one of the above regular expressions - never got it to work

Question:

How do I write the regular expression to enforce/validate the format of EVERY CAS instance no matter how often it occurs (there is a potential for 0 instances)?

Does ultra-edit support modifiers? Looks like the regular expression is fine, you just need it to repeat over the whole string. In Perl that's the 'g' modifier. — Cfreak, Jun 30 '11 at 19:08
Yes, the global match modifier should work. Also, since you know that the things you want to match are between `~`, you can do the following: `open(my $read,"<","data.txt") or die;while(<$read>){chomp; my @arr=grep{$_=~/^CAS/}split /~/;if(@arr){foreach my $i (@arr){my @sub_arr=split(/*/,$i); #yada, yada, insert test to determine if each element of @sub_arr is correct.}}` In other words, split each line on `~`, and for each element that's splitted and starts with `CAS`, split *that* on `*`. Then, it's easy to check the format. — , Jun 30 '11 at 19:17
I'm confused by "_first instance if finds_"? Ultraedit highlights matches one at a time, and you use f3 for "find next". In my version there is also a checkbox to "highligh all items found". Are you saying this is not working or are you trying to do a global replace? — cordsen, Jun 30 '11 at 19:19
@cordsen - The file has 10's of thousands of lines. Manually iterating over it is not an option. Bookmarking is a similar problem. — kdroyce, Jun 30 '11 at 19:29
@Jack Maney - I'm actually not writing this in Perl (although I think I would have saved myself a ton of headache if I had). — kdroyce, Jun 30 '11 at 19:30
@cfreak - I'm still trying to figure out if UE supports modifiers. Thanks for the idea. — kdroyce, Jun 30 '11 at 19:38
@ cfreak - so what would the modifier syntax look like? REGEX/-g? — kdroyce, Jun 30 '11 at 19:42
@kdroyce - in Perl it's REGEX/g (no dash). I have no idea in UE though :) — Cfreak, Jun 30 '11 at 19:47

score 1 · Accepted Answer · answered Jun 30 '11 at 19:17

1

To say that every CAS instance in your string is valid is to say that there does not exist at least one invalid CAS sequence. The approach you were getting at with a negative lookahead is the simplest way to represent this - here's an example:

/^(?!.*CAS(?!<whatever matches a valid CAS instance>))/

Basically: "Make sure there does not exist in the string an instance of CAS that is not followed by whatever matches a valid CAS instance". Replace the contents of the second negative lookahead, and include whatever it is before 'CAS' that indicates the start of a CAS instance.

As you can see, you don't need to match the string from start to finish to do what you want.

answered Jun 30 '11 at 19:17

jaytea

1,861
1
14
19

@ jaytea - let me give that a try. thanks for the idea. I'll probably be back with some questions so please don't go away. – kdroyce Jun 30 '11 at 19:45
for some reason I have a hard time seeing this conceptually. Based on the snippet above here is what I tried (and failed): /^(?!.*CAS(?!\*(OA|PR|CR|CO)\*[^0-9A-Z]+\*[-]?[\d]+\.?[\d]{0,2})~)/ – kdroyce Jun 30 '11 at 19:53
do I need to include the forward slashes? – kdroyce Jun 30 '11 at 19:57
For anyone that is interested in the solution, jaytea's direction yielded the right answer: ^(?!.*CAS(?!\*(OA|PR|CR|CO)\*[0-9A-Z]+\*[-]?[\d]+\.?[\d]{0,2}~)) – kdroyce Jun 30 '11 at 20:54

score 0 · Answer 2 · answered Jun 30 '11 at 19:15

This idea will make sure the whole line is correct. E.G. It will not match the line unless it is correct.

^(regexThatOnlyMatchesASingleCorrectInstance)*$

This starts at the beginning of the line ^ and matches as many as it can + of regexThatOnlyMatchesASingleCorrectInstance and ensures that the end of the string $ is found right after the last one.

Of course this will only work when there is a ~ at the end of the string. For the ~ part, use this: (?:~|$) so that you it doesn't require the delimiter at the end of the string.