3

I have two strings, where some part of the string is optional. So, I tried to create an optional pattern by using ? after every group which I want to make optional. But, it's giving None as output for those groups.

Text-1:

text1 = '95031B2\tR\tC01 N1 P93 R-- 12:39:18.540 08/05/20 0000002802 R  -                               No_barcode  FLC   F  LR    7.673353 sccm   Pt   25.288202 psig   FL  536.651917 sccm   EDC   0.000000 sccm   PQ    7.668324 sccm   QF  536.289246 sccm   QP   25.287605 psig   LLQ  -0.109524 sccm   HLQ   4.440174 sccm   CLF   1.429953 sccm   MF    0.000000 sccm   LF  100.000015 sccm   MQF   0.000000 sccm   LQF 100.000015 sccm   FPR  25.290846 psig \r\n'

Text-2:

text2 = '5102060\tR\tC01 N1 P93 R-- 12:38:52.140 08/05/20 0000002801 FO -                               No_barcode \r\n'

Working pattern for text1:

pattern1 = ['(?P<time>\d\d:\d\d:\d\d.\d{3})\s',
           '(?P<date>\d\d/\d\d/\d\d)\s',
           '(?P<sno>\d{10})\s',
           '(?P<status>\w{1,2}).*?-',
           '\s*',
           '(?P<bcode>No_barcode|\W{20})',
           '\s{2}',
           '(?P<type>\w{3})',
           '.*?',
           '(?P<pr>Pt.*?\d*[.]?\d*\s[a-z]+)'
           '\s{1,3}',
           '(?P<fl>FL.*?\d*[.]?\d*\s[a-z]+)'
           ]

Tried to make optional part in above pattern to work with both the strings:

>>> pattern2 = ['(?P<time>\d\d:\d\d:\d\d.\d{3})\s', # time pattern
           '(?P<date>\d\d/\d\d/\d\d)\s',           # date pattern
           '(?P<sno>\d{10})\s',                    # 10 digits
           '(?P<status>\w{1,2}).*?-',              # 1 or 2 alphabets follows with anything and then hyphen('-')
           '\s*',                                  # zero or more spaces
           '(?P<bcode>No_barcode|\W{20})',         # No_barcode or any alphanumeric with 20 length
                                       
                     # OPTIONAL PART STARTS (Not working)
           '(\s{1,2}|',                            # 1 or 2 spaces or
           '(?P<type>\w{3})|',                     # 3 alphabets or
           '.*?|',                                 # anything getting ignored or
           '(?P<pr>Pt.*?\d*[.]?\d*\s[a-z]+)|'      # Pt digits optional decimal followed with digits, 1 space, 1 or more a-z alphabets or
           '\s{1,3}|',                             # 1 to 3 spaces or
           '(?P<fl>FL.*?\d*[.]?\d*\s[a-z]+))?'     # FL digits optional decimal followed with digits, 1 space, 1 or more a-z alphabets 
           ]

Output:

>>> res = re.search(r''.join(pattern1), text) # pattern1
>>> res.groups()
('12:39:18.540', '08/05/20', '0000002802', 'R', 'No_barcode', 'FLC', 'Pt   25.288202 psig', 'FL  536.651917 sccm')

>>> res = re.search(r''.join(pattern2), text) # pattern2, trying to get same output as pattern1
>>> res.groups()
('12:39:18.540', '08/05/20', '0000002802', 'R', 'No_barcode', '  ', None, None, None)

Expected output:

For pattern2(after adding an optional part to pattern1), I should get the same output as the output of pattern1.

>>> res = re.search(r''.join(pattern2), text) # pattern2
>>> res.groups()
('12:39:18.540', '08/05/20', '0000002802', 'R', 'No_barcode', 'FLC', 'Pt   25.288202 psig', 'FL  536.651917 sccm')
shaik moeed
  • 5,300
  • 1
  • 18
  • 54
  • can you specify what your objective is and what you expect as an output? – tomanizer Aug 07 '20 at 10:20
  • @tomanizer I have added expected output details in question. – shaik moeed Aug 07 '20 at 10:25
  • You wrapped all the part after obligatory part with an optional group, and inserted `|` alternation operator in between each line in the pattern, it is wrong. If the parts are all optional you need to wrap each part with an optional group. – Wiktor Stribiżew Aug 07 '20 at 10:26
  • thank you. I do not see any FL, Pt in pattern 2 in pattern 2. why do you expect it to show up in the regex? – tomanizer Aug 07 '20 at 10:27
  • @tomanizer Pt, FL is there in both the patterns. Pt and FL are present in text1 and not there in text2. So, I want to consider them as optional part. – shaik moeed Aug 07 '20 at 10:29
  • @WiktorStribiżew I have tried to add optional operator for each group. But it does not work as expected. Can you please post the regex as an answer? – shaik moeed Aug 07 '20 at 10:31
  • 1
    You did not create the optional groups correctly. Also, you should specify if the subsequent subpatterns match depends on whether the preceding subpatterns are found or not. – Wiktor Stribiżew Aug 07 '20 at 10:36

1 Answers1

1

You wrapped all the part after obligatory part with an optional group, and inserted | alternation operator in between each line in the pattern, which is a wrong way to make optional groups.

If the parts are all optional and the subsequent subpatterns do not depend on whether the preceding subpattern is found or not, you need to either wrap each part with an optional group: <obligatory_part>(?:...(optional_1)...)?(?:...(optional_2)...)?(?:...(optional_n)...)?.

Or, make nested optional groups if each subsequent pattern cannot appear if the preceding subpattern is missing: <obligatory_part>(?:...(optional_1)...(?:...(optional_2)...(?:...(optional_n)...)?)?)?.

So, you may use either this regex:

(?P<time>\d\d:\d\d:\d\d.\d{3})\s(?P<date>\d\d/\d\d/\d\d)\s(?P<sno>\d{10})\s(?P<status>\w{1,2}).*?-\s*(?P<bcode>No_barcode|\W{20})(?:\s+(?P<type>\w{3}))?(?:.*?(?P<pr>Pt.*?\d*[.]?\d*\s[a-z]+))?(?:\s{1,3}(?P<fl>FL.*?\d*[.]?\d*\s[a-z]+))?

Or this regex:

(?P<time>\d\d:\d\d:\d\d.\d{3})\s(?P<date>\d\d/\d\d/\d\d)\s(?P<sno>\d{10})\s(?P<status>\w{1,2}).*?-\s*(?P<bcode>No_barcode|\W{20})(?:\s+(?P<type>\w{3})(?:.*?(?P<pr>Pt.*?\d*[.]?\d*\s[a-z]+)(?:\s{1,3}(?P<fl>FL.*?\d*[.]?\d*\s[a-z]+))?)?)?

Since you say you need Scheme 1, you may use

pattern = [r'(?P<time>\d\d:\d\d:\d\d.\d{3})\s',
           r'(?P<date>\d\d/\d\d/\d\d)\s',
           r'(?P<sno>\d{10})\s',
           r'(?P<status>\w{1,2}).*?-',
           r'\s*',
           r'(?P<bcode>No_barcode|\W{20})',
           r'(?:\s+',                     # (?: starts a non-capturing group...
           r'(?P<type>\w{3}))?',          # ...)? closes the group, 
           r'(?:.*?',
           r'(?P<pr>Pt.*?\d*[.]?\d*\s[a-z]+))?'
           r'(?:\s{1,3}',
           r'(?P<fl>FL.*?\d*[.]?\d*\s[a-z]+))?'
          ]
print(r''.join(pattern))

See Python demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Let me know which optional scheme you need to implement, I will add Python code. – Wiktor Stribiżew Aug 07 '20 at 10:38
  • I want to use 1st regex, nested optional groups are not required but thanks for sharing that. It's helpful. :) – shaik moeed Aug 07 '20 at 10:43
  • 1
    @shaikmoeed I added the Scheme 1 pattern definition. I suggest using raw string literals, by the way, to avoid any issues with backslashes. – Wiktor Stribiżew Aug 07 '20 at 10:47
  • I didn't understand the purpose of non-capturing groups used in every optional group. Not only about this example, but I have also read others. Can you explain it with the present example? – shaik moeed Aug 07 '20 at 10:54
  • @shaikmoeed Non-capturing groups are used as a container for a sequence of patterns, without storing the match in a separate memory slot. Since you only rely on named groups, you might use capturing groups, but it is comutationally tidier to use non-capturing ones. Also, see [Are non-capturing groups redundant?](https://stackoverflow.com/questions/31500422/are-non-capturing-groups-redundant). – Wiktor Stribiżew Aug 07 '20 at 10:57
  • I have tried to add only the optional group and removed the non-capturing group, it didn't work. Do you know the reason for this? – shaik moeed Aug 07 '20 at 11:07
  • @shaikmoeed Because there must be some text matched in between them. – Wiktor Stribiżew Aug 07 '20 at 11:15