1

Using regex, I need to get the expected ouput:

string="Tue Apr 24 22:35:48 2018 53/e33 
hello:55|Wordcap|abc|abc generate|6|Wordcapdata_proto_req=WINTER Wordcapdata_sample_resp=summer 2.4.5 WordcapTotal_reject=56 WordcapAddition_sum=TEA CUP ONE"

Expected output = ['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']

The problem is dealing with the spaces in these strings : summer 2.4.5 or TEA CUP ONE

This is my intial attempt at getting the regex:

print re.findall(r'[W]*ordcap([^|].*?=.*?)[\s|\t]*(?:W|$)', string)

The output I'm getting is :

['data_proto_req=', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']
  • Try [`r'\bWordcap([^|=]*?=[^|]*?)(?=Wordcap|$)'`](https://regex101.com/r/zGs5rA/2) – Wiktor Stribiżew Apr 25 '18 at 07:32
  • @cdarke That `W*` is used because the `(?:W|$)` is consuming the `W` in the next match. Actually, that is why I suggest a lookahead here. – Wiktor Stribiżew Apr 25 '18 at 07:34
  • Hi @WiktorStribiżew It almost worked, thanks for sharing. Modified it to : r'\bWordcap([^|]*?=.*?)\s*(?=Wordcap|$)' and now its working. I wonder what this does (?=) instead of (?:) – Tanmay Sawant Apr 25 '18 at 07:39
  • It is a positive lookahead. It does not consume chars, so there is no need making `W` optional. I added the information to [my answer](https://stackoverflow.com/a/50016565/3832970). – Wiktor Stribiżew Apr 25 '18 at 08:44

3 Answers3

1

Note that (?:W|$) is consuming the W in the next match and thus you used [W]*. Actually, this is the case where lookarounds should be used since they do not consume text, just check if there is a match or not without putting the matched text into the match value.

You may use

\bWordcap([^|=]*=.*?)(?=\s*\bWordcap|$)

See the regex demo

Details

  • \bWordcap - a word boundary followed with Wordcap
  • ([^|=]*=[^|]*?) - Group 1:
    • [^|=]* - any 0+ chars other than | and =, as many as possible
    • = - an = sign
    • .*? - any 0+ chars other than a newline, as few as possible
  • (?=\s*\bWordcap|$) - a positive lookahead that requires 0+ whitespaces, a word boundary and Wordcap string immediately to the right of the current location, or end of string.

See the Python demo:

import re
rx = r"\bWordcap([^|=]*=.*?)(?=\s*\bWordcap|$)"
s = "Tue Apr 24 22:35:48 2018 53/e33 \nhello:55|Wordcap|abc|abc generate|6|Wordcapdata_proto_req=WINTER Wordcapdata_sample_resp=summer 2.4.5 WordcapTotal_reject=56 WordcapAddition_sum=TEA CUP ONE"
print(re.findall(rx, s))
# => ['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hi @Wiktor thanks for sharing the solution. If I'm having "Wordcapdata_drive=\\\\10.10.10.2\\RECEIVE" in the input string. Using the regex re.findall it parses the above string as "\\\\\\\\10.10.10.2\\\\RECEIVE" I'm not sure why it adds the extra forward slash characters ! It tries to escape the forward slash, but I don't need these extra forward slashes. What might be the issue – Tanmay Sawant Apr 25 '18 at 08:18
  • @TanmaySawant There are no "extra" slashes, you just see them in the console. See [Why do backslashes appear twice?](https://stackoverflow.com/questions/24085680/why-do-backslashes-appear-twice) – Wiktor Stribiżew Apr 25 '18 at 08:21
  • Hi @Wiktor Suppose, after parsing through re.findall I get this output list1=["\\\\\\\\10.10.10.2\\\\RECEIVE"] And I have one input list -> list2=['\\\\10.10.10.2\\RECEIVE"] print list2[0] in list1 => FALSE So here in the list1 I've got the additional slashes, but when I have to compare list2 and list1, it throws False. But actually, those 2 strings should be the same. – Tanmay Sawant Apr 25 '18 at 08:57
  • @TanmaySawant Please share the whole relevant code, peferably via http://ideone.com. Note I have no idea how your `list2` is created. If manually, you should know how to insert ``\`` into a string literal. A single backslash is "coded" as either `"a\\b"` (regular string literal) or `r"a\b"` (raw string literal). – Wiktor Stribiżew Apr 25 '18 at 09:05
0

Here is your solution,

myStr = """Tue Apr 24 22:35:48 2018 53/e33 hello:55| 
Wordcap|abc|abc generate|6|Wordcapdata_proto_req=WINTER 
Wordcapdata_sample_resp=summer 2.4.5 WordcapTotal_reject=56 
WordcapAddition_sum=TEA CUP ONE"""

print(re.findall(r'(?<=Wordcap)[^|]*?(?= Wordcap|$)', myStr))

# ['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5',
# 'Total_reject=56', 'Addition_sum=TEA CUP ONE']
BcK
  • 2,548
  • 1
  • 13
  • 27
0
print(re.findall(r'Wordcap([^|].*?=.*?)(?= Wordcap|$)', string))

gives

['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']
Richard Inglis
  • 5,888
  • 2
  • 33
  • 37