Error in tag separated by `|` using Regex python

Question

I want to add | before every tag. Please check the below code that I have used.

tags = ['XYZ', 'CREF', 'BREF', 'RREF', 'REF']

string_data = 'XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY'

for each_tag in tags:
    result = string_data.replace(each_tag, "|" + each_tag)
    print(result)

How can I do it using the Regex?

Input String:

XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY

Actual result (wrong):

XYZ:MUMBAI UNIVERSITYC|REF:PUNE UNIVERSITYB|REF:DADAR UNIVERSITYR|REF:KOLHAPUR UNIVERCITY LLC|REF:SOLAPUR UNIVERSITY

Expected result:

Is there any way to do it using regex?

Unfortunately it's impossible to know whether "LLCREF:" should be "LLC/REF:" or "LL/CREF:". — AKX, Feb 25 '20 at 07:37
No needs regex. Use somenthing like this: `"|" + "|".join(['XYZ', 'CREF', 'BREF', 'RREF']) ` — felipsmartins, Feb 25 '20 at 07:37
@GaganTK No, Please check the above error result and Expected Result. — Akshay Godase, Feb 25 '20 at 07:49
@AkshayGodase OK, got it. Can you please add your code that you tried into the question? — Gagan T K, Feb 25 '20 at 07:50
@GaganTK I have Update code and Actual Result (Wrong). Please verify it — Akshay Godase, Feb 25 '20 at 08:03

Sree Kumar · Answer 1 · 2020-02-25T10:18:39.610

Since your most important problem is to split the string correctly, I have tried to address only that. You can append and prepend the | afterwards.

This pattern seems to be working:

(XYZ|CREF|BREF|RREF|REF):[a-zA-Z\\s]+?(LLC)?(?=(XYZ|CREF|BREF|RREF|REF)|$)

Explanation:

(XYZ|CREF|BREF|RREF|REF): : This is obvious. You are looking for the start of the tag. The order is important. That is, keep the shortest substring REF at the end.
[a-zA-Z\\s]+? : Match any character and space that occur after the tag, reluctantly. Reluctant, because if the engine reaches the start of CREF, we want it to stop there and NOT to take more characters "greedily". Because of using "reluctance", the order of tags in point (4) matters.
(LLC)? : This is a kind of an exception list of all known words that end with character sequences that the tags may start with. (For this, I could not think of any other way.) The exception list must be known and could be separately configured and appended to the pattern runtime. If the input data structure is known beforehand and such exceptions are limited and known, this is not a bottleneck. Otherwise, yes.
(?=(XYZ|CREF|BREF|RREF|REF)|$) : A lookahead to ensure that the engine stops when it finds one of the tags coming up. $ allows to stop at the end of the input, if there is no more tag.

This gives the following output for the input string you have provided:

XYZ:MUMBAI UNIVERSITY
CREF:PUNE UNIVERSITY
BREF:DADAR UNIVERSITY
RREF:KOLHAPUR UNIVERCITY LLC
REF:SOLAPUR UNIVERSITY

Edit

Adding the Python 3.8.1 code that I tested:

import re

s = "XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY"

p = "(XYZ|CREF|BREF|RREF|REF):[a-zA-Z\\s]+?(LLC)?(?=(XYZ|CREF|BREF|RREF|REF)|$)"

matches = re.finditer( p,s )

tag_list = [ m.group() for m in matches ]
s2 = "|" + "|".join( tag_list )
print( s2 )

@Sree_Kumar tried the above code but not getting the result. — Akshay Godase, Feb 25 '20 at 09:21
@AkshayGodase I have posted the code I used to test. Can you please check? — Sree Kumar, Feb 25 '20 at 09:28
@Sreee_Kumar I have tested your code it works properly. But I am expecting the Expected Result. Please check the below `|XYZ:MUMBAI UNIVERSITY|CREF:PUNE UNIVERSITY|BREF:DADAR UNIVERSITY|RREF:KOLHAPUR UNIVERCITY LLC|REF:SOLAPUR UNIVERSITY` — Akshay Godase, Feb 25 '20 at 09:47
OK. Isn't simply joining the parts enough after this? Or do you want a solution with a regex level replacement? — Sree Kumar, Feb 25 '20 at 10:09

score 1 · Accepted Answer · answered Feb 25 '20 at 12:15

1

You could match an optional B or R or match a C when not preceded with an L using a negative lookbehind.

(?:[BR]?|(?<!L)C)REF|^(?!\|)

Explanation

(?: Non capture group
- [BR]? Match an optional B or R
- | Or
- (?<!L)C Match a C and assert what is directly to the left is not L
) Close group
REF Match literally
| Or
^(?!\|) Assert the start of the string when not directly followed by a | to prevent starting with a double || if there already is one present

Regex demo | Python demo

In the replacement use the match prepended with a pipe

|\g<0>

For example

import re

regex = r"(?:[BR]?|(?<!L)C)REF|^(?!\|)"
test_str = "XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY"
subst = "|\\g<0>"
result = re.sub(regex, subst, test_str)

print (result)

Output

|XYZ:MUMBAI UNIVERSITY|CREF:PUNE UNIVERSITY|BREF:DADAR UNIVERSITY|RREF:KOLHAPUR UNIVERCITY LLC|REF:SOLAPUR UNIVERSITY

answered Feb 25 '20 at 12:15

The fourth bird

154,723
16
55
70

@The_fourth_bird It works successfully. If I have tested it on another string like `"XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLBREF:SOLAPUR UNIVERSITY"` Then it is not working. This will work on any case means if before `REF` if I got any latter within A-Z` then I will work. – Akshay Godase Feb 26 '20 at 10:21
@The_fourth_bird Please check the above comment highlighted string. Let me know if you have any questions. – Akshay Godase Feb 26 '20 at 10:23
You could use the character class A-Z in the lookbehind. https://regex101.com/r/KtfLAR/1 But it will not match for `LLLREF` then. What is the expected in that case? Can you update the regex101 link with what should and what should not be matched? – The fourth bird Feb 26 '20 at 10:29

score 0 · Answer 3 · answered Feb 25 '20 at 08:11

Your problem is the duplication between 'CREF', 'BREF', 'RREF' and 'REF' - since 'REF' is in all the other three, you will end up with duplicate replacements if you fix your code to this:

tags = ['XYZ', 'CREF', 'BREF', 'RREF', 'REF']

string_data = 'XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY'

for each_tag in tags:
    string_data= string_data.replace(each_tag, "|" + each_tag)
    print(string_data)

You need to make sure you only replace 'REF' if it's not preceeded by a 'C', 'B' or 'R'.

Note that this would still cause issues for some cases like XYZ:CARE BEARREF. I.e. you might expect |XYZ:CARE BEAR|REF, but you'll get |XYZ:CARE BEA|RREF. If you want to avoid that, you need to be more precise about the actual rules.

This works, if you know this type of problem won't occur:

import re

string_data = 'XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY'

result = re.sub("(XYZ|CREF|BREF|RREF|REF)", r"|\1", string_data )
print(result)

This avoids specific checks, since regex takes into account the ordering and won't match REF after the text has already been matched as part of the previous values.

I tried your code but my expected result did not match with your result. — Akshay Godase, Feb 25 '20 at 09:20

score 0 · Answer 4 · answered Feb 27 '20 at 05:36

0

import re

string = "XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY"

regx = "(XYZ|[C|B|R]REF|REF):[a-zA-Z\s]+?(LLC)?(?=(XYZ|[C|B|R]REF)|REF|$)"

matches = re.finditer(regx, string)

tag = []

for match in matches: tag.append(match.group())

result= "|" + "|".join(tag) print(result)

answered Feb 27 '20 at 05:36

Vini Patel

1

Welcome to StackOverflow, and congrats on your first post. You'll get a more positive response (i.e. upvotes and increased reputation) if you format your code as code, and add some text to describe why/how your answer works. – Caleb Feb 27 '20 at 05:48

Error in tag separated by `|` using Regex python

4 Answers4

Explanation:

Edit