4

I want to add | before every tag. Please check the below code that I have used.

tags = ['XYZ', 'CREF', 'BREF', 'RREF', 'REF']

string_data = 'XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY'

for each_tag in tags:
    result = string_data.replace(each_tag, "|" + each_tag)
    print(result)

How can I do it using the Regex?

Input String:

XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY

Actual result (wrong):

XYZ:MUMBAI UNIVERSITYC|REF:PUNE UNIVERSITYB|REF:DADAR UNIVERSITYR|REF:KOLHAPUR UNIVERCITY LLC|REF:SOLAPUR UNIVERSITY

Expected result:

|XYZ:MUMBAI UNIVERSITY|CREF:PUNE UNIVERSITY|BREF:DADAR UNIVERSITY|RREF:KOLHAPUR UNIVERCITY LLC|REF:SOLAPUR UNIVERSITY

Is there any way to do it using regex?

Akshay Godase
  • 187
  • 1
  • 12

4 Answers4

2

Since your most important problem is to split the string correctly, I have tried to address only that. You can append and prepend the | afterwards.

This pattern seems to be working:

(XYZ|CREF|BREF|RREF|REF):[a-zA-Z\\s]+?(LLC)?(?=(XYZ|CREF|BREF|RREF|REF)|$)

Explanation:

  1. (XYZ|CREF|BREF|RREF|REF): : This is obvious. You are looking for the start of the tag. The order is important. That is, keep the shortest substring REF at the end.
  2. [a-zA-Z\\s]+? : Match any character and space that occur after the tag, reluctantly. Reluctant, because if the engine reaches the start of CREF, we want it to stop there and NOT to take more characters "greedily". Because of using "reluctance", the order of tags in point (4) matters.
  3. (LLC)? : This is a kind of an exception list of all known words that end with character sequences that the tags may start with. (For this, I could not think of any other way.) The exception list must be known and could be separately configured and appended to the pattern runtime. If the input data structure is known beforehand and such exceptions are limited and known, this is not a bottleneck. Otherwise, yes.
  4. (?=(XYZ|CREF|BREF|RREF|REF)|$) : A lookahead to ensure that the engine stops when it finds one of the tags coming up. $ allows to stop at the end of the input, if there is no more tag.

This gives the following output for the input string you have provided:

XYZ:MUMBAI UNIVERSITY
CREF:PUNE UNIVERSITY
BREF:DADAR UNIVERSITY
RREF:KOLHAPUR UNIVERCITY LLC
REF:SOLAPUR UNIVERSITY

Edit

Adding the Python 3.8.1 code that I tested:

import re

s = "XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY"

p = "(XYZ|CREF|BREF|RREF|REF):[a-zA-Z\\s]+?(LLC)?(?=(XYZ|CREF|BREF|RREF|REF)|$)"

matches = re.finditer( p,s )

tag_list = [ m.group() for m in matches ]
s2 = "|" + "|".join( tag_list )
print( s2 )
Sree Kumar
  • 2,012
  • 12
  • 9
1

You could match an optional B or R or match a C when not preceded with an L using a negative lookbehind.

(?:[BR]?|(?<!L)C)REF|^(?!\|)

Explanation

  • (?: Non capture group
    • [BR]? Match an optional B or R
    • | Or
    • (?<!L)C Match a C and assert what is directly to the left is not L
  • ) Close group
  • REF Match literally
  • | Or
  • ^(?!\|) Assert the start of the string when not directly followed by a | to prevent starting with a double || if there already is one present

Regex demo | Python demo

In the replacement use the match prepended with a pipe

|\g<0>

For example

import re

regex = r"(?:[BR]?|(?<!L)C)REF|^(?!\|)"
test_str = "XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY"
subst = "|\\g<0>"
result = re.sub(regex, subst, test_str)

print (result)

Output

|XYZ:MUMBAI UNIVERSITY|CREF:PUNE UNIVERSITY|BREF:DADAR UNIVERSITY|RREF:KOLHAPUR UNIVERCITY LLC|REF:SOLAPUR UNIVERSITY
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • @The_fourth_bird It works successfully. If I have tested it on another string like `"XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLBREF:SOLAPUR UNIVERSITY"` Then it is not working. This will work on any case means if before `REF` if I got any latter within A-Z` then I will work. – Akshay Godase Feb 26 '20 at 10:21
  • @The_fourth_bird Please check the above comment highlighted string. Let me know if you have any questions. – Akshay Godase Feb 26 '20 at 10:23
  • You could use the character class A-Z in the lookbehind. https://regex101.com/r/KtfLAR/1 But it will not match for `LLLREF` then. What is the expected in that case? Can you update the regex101 link with what should and what should not be matched? – The fourth bird Feb 26 '20 at 10:29
0

Your problem is the duplication between 'CREF', 'BREF', 'RREF' and 'REF' - since 'REF' is in all the other three, you will end up with duplicate replacements if you fix your code to this:

tags = ['XYZ', 'CREF', 'BREF', 'RREF', 'REF']

string_data = 'XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY'

for each_tag in tags:
    string_data= string_data.replace(each_tag, "|" + each_tag)
    print(string_data)

You need to make sure you only replace 'REF' if it's not preceeded by a 'C', 'B' or 'R'.

Note that this would still cause issues for some cases like XYZ:CARE BEARREF. I.e. you might expect |XYZ:CARE BEAR|REF, but you'll get |XYZ:CARE BEA|RREF. If you want to avoid that, you need to be more precise about the actual rules.

This works, if you know this type of problem won't occur:

import re

string_data = 'XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY'

result = re.sub("(XYZ|CREF|BREF|RREF|REF)", r"|\1", string_data )
print(result)

This avoids specific checks, since regex takes into account the ordering and won't match REF after the text has already been matched as part of the previous values.

Grismar
  • 27,561
  • 4
  • 31
  • 54
0

import re

string = "XYZ:MUMBAI UNIVERSITYCREF:PUNE UNIVERSITYBREF:DADAR UNIVERSITYRREF:KOLHAPUR UNIVERCITY LLCREF:SOLAPUR UNIVERSITY"

regx = "(XYZ|[C|B|R]REF|REF):[a-zA-Z\s]+?(LLC)?(?=(XYZ|[C|B|R]REF)|REF|$)"

matches = re.finditer(regx, string)

tag = []

for match in matches: tag.append(match.group())

result= "|" + "|".join(tag) print(result)

  • Welcome to StackOverflow, and congrats on your first post. You'll get a more positive response (i.e. upvotes and increased reputation) if you format your code as code, and add some text to describe why/how your answer works. – Caleb Feb 27 '20 at 05:48