Extract text with spaces followed by string or end of line

Question

Using regex, I need to get the expected ouput:

string="Tue Apr 24 22:35:48 2018 53/e33 
hello:55|Wordcap|abc|abc generate|6|Wordcapdata_proto_req=WINTER Wordcapdata_sample_resp=summer 2.4.5 WordcapTotal_reject=56 WordcapAddition_sum=TEA CUP ONE"

Expected output = ['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']

The problem is dealing with the spaces in these strings : summer 2.4.5 or TEA CUP ONE

This is my intial attempt at getting the regex:

print re.findall(r'[W]*ordcap([^|].*?=.*?)[\s|\t]*(?:W|$)', string)

The output I'm getting is :

['data_proto_req=', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']

Try [`r'\bWordcap([^|=]*?=[^|]*?)(?=Wordcap|$)'`](https://regex101.com/r/zGs5rA/2) — Wiktor Stribiżew, Apr 25 '18 at 07:32
@cdarke That `W*` is used because the `(?:W|$)` is consuming the `W` in the next match. Actually, that is why I suggest a lookahead here. — Wiktor Stribiżew, Apr 25 '18 at 07:34
Hi @WiktorStribiżew It almost worked, thanks for sharing. Modified it to : r'\bWordcap([^|]*?=.*?)\s*(?=Wordcap|$)' and now its working. I wonder what this does (?=) instead of (?:) — Tanmay Sawant, Apr 25 '18 at 07:39
It is a positive lookahead. It does not consume chars, so there is no need making `W` optional. I added the information to [my answer](https://stackoverflow.com/a/50016565/3832970). — Wiktor Stribiżew, Apr 25 '18 at 08:44

Wiktor Stribiżew · Accepted Answer · 2018-04-25T07:43:30.630

1

Note that (?:W|$) is consuming the W in the next match and thus you used [W]*. Actually, this is the case where lookarounds should be used since they do not consume text, just check if there is a match or not without putting the matched text into the match value.

You may use

\bWordcap([^|=]*=.*?)(?=\s*\bWordcap|$)

See the regex demo

Details

\bWordcap - a word boundary followed with Wordcap
([^|=]*=[^|]*?) - Group 1:
- [^|=]* - any 0+ chars other than | and =, as many as possible
- = - an = sign
- .*? - any 0+ chars other than a newline, as few as possible
(?=\s*\bWordcap|$) - a positive lookahead that requires 0+ whitespaces, a word boundary and Wordcap string immediately to the right of the current location, or end of string.

See the Python demo:

import re
rx = r"\bWordcap([^|=]*=.*?)(?=\s*\bWordcap|$)"
s = "Tue Apr 24 22:35:48 2018 53/e33 \nhello:55|Wordcap|abc|abc generate|6|Wordcapdata_proto_req=WINTER Wordcapdata_sample_resp=summer 2.4.5 WordcapTotal_reject=56 WordcapAddition_sum=TEA CUP ONE"
print(re.findall(rx, s))
# => ['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']

edited Apr 25 '18 at 07:43

answered Apr 25 '18 at 07:37

Wiktor Stribiżew

607,720
39
448
563

Hi @Wiktor thanks for sharing the solution. If I'm having "Wordcapdata_drive=\\\\10.10.10.2\\RECEIVE" in the input string. Using the regex re.findall it parses the above string as "\\\\\\\\10.10.10.2\\\\RECEIVE" I'm not sure why it adds the extra forward slash characters ! It tries to escape the forward slash, but I don't need these extra forward slashes. What might be the issue – Tanmay Sawant Apr 25 '18 at 08:18
@TanmaySawant There are no "extra" slashes, you just see them in the console. See [Why do backslashes appear twice?](https://stackoverflow.com/questions/24085680/why-do-backslashes-appear-twice) – Wiktor Stribiżew Apr 25 '18 at 08:21
Hi @Wiktor Suppose, after parsing through re.findall I get this output list1=["\\\\\\\\10.10.10.2\\\\RECEIVE"] And I have one input list -> list2=['\\\\10.10.10.2\\RECEIVE"] print list2[0] in list1 => FALSE So here in the list1 I've got the additional slashes, but when I have to compare list2 and list1, it throws False. But actually, those 2 strings should be the same. – Tanmay Sawant Apr 25 '18 at 08:57
@TanmaySawant Please share the whole relevant code, peferably via http://ideone.com. Note I have no idea how your `list2` is created. If manually, you should know how to insert ``\`` into a string literal. A single backslash is "coded" as either `"a\\b"` (regular string literal) or `r"a\b"` (raw string literal). – Wiktor Stribiżew Apr 25 '18 at 09:05

BcK · Answer 2 · 2018-04-25T07:48:38.680

0

Here is your solution,

myStr = """Tue Apr 24 22:35:48 2018 53/e33 hello:55| 
Wordcap|abc|abc generate|6|Wordcapdata_proto_req=WINTER 
Wordcapdata_sample_resp=summer 2.4.5 WordcapTotal_reject=56 
WordcapAddition_sum=TEA CUP ONE"""

print(re.findall(r'(?<=Wordcap)[^|]*?(?= Wordcap|$)', myStr))

# ['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5',
# 'Total_reject=56', 'Addition_sum=TEA CUP ONE']

edited Apr 25 '18 at 07:48

answered Apr 25 '18 at 07:37

BcK

2,548
1
13
27

score 0 · Answer 3 · answered Apr 25 '18 at 07:38

0

print(re.findall(r'Wordcap([^|].*?=.*?)(?= Wordcap|$)', string))

gives

['data_proto_req=WINTER', 'data_sample_resp=summer 2.4.5', 'Total_reject=56', 'Addition_sum=TEA CUP ONE']

answered Apr 25 '18 at 07:38

Richard Inglis

5,888
2
33
37

Extract text with spaces followed by string or end of line

3 Answers3