How to match paragraphs containing a specific pattern with regex?

Question

I have the following paragraphs :

This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph

This is paragraph #2
London, Paris
End of paragraph

This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph

This is paragraph #4
End of paragraph

This is paragraph #5
Paris, Berlin
Some other text
End of paragraph

How can I, with a regex, match the paragraphs containing e.g. New-York (#1 and #3) or London (#1, #2) ? or even New-York AND Berlin (#1, #3) ?

I have found an answer in S.O.

How match a paragraph using regex

which allows me to match the paragraphs (all the text between two blank lines).

But I cannot figure (my regex skills are… limited) how to match the paragraphs containing a specific pattern, and only those paragraphs.

Thanks in advance for your help

NB : the idea is to use the answer in the Editorial IOS app to fold the answers NOT containing the pattern.

Which programming language do you use? It might be easier to split the paragraphs first (on empty lines) and then look for `New-York` in them. — Jan, Nov 21 '17 at 13:01
Which flavor of regex? Python? Do you have to use regex in one line? The answer you link to splits on "\n\n". — kabanus, Nov 21 '17 at 13:06
@Jan : I do not want to split the paragraphs : I want to keep thr entire paragraphs containing the specified pattern, and those paragraphs only. — ThG, Nov 21 '17 at 13:11
Just a couple of raw ideas: https://regex101.com/r/pWP0CK/1, https://regex101.com/r/pWP0CK/2 and https://regex101.com/r/pWP0CK/3 — Wiktor Stribiżew, Nov 21 '17 at 13:40
A bit optimized... https://regex101.com/r/BWZkb9/1/ Now, you see it is better to resort to the code to check for things like the city names. Or at least run 2 separate regexes, one to extract and the other to filter. Or use the `regex` module as Jan suggests. — Wiktor Stribiżew, Nov 21 '17 at 13:47
@Wiktor Stribiżew : superb ! Could you make it an answer in order to allow me to to accept it ? (I have thanked Jan by signaling his answer was useful). Thanks a lot — ThG, Nov 21 '17 at 13:57

score 4 · Accepted Answer · answered Nov 21 '17 at 14:05

I see you might have no access to the Python code itself if you plan to use the pattern in the Editorial iOS app.

Then, all I can suggest is the pattern like

(?m)^(?=.*(?:\r?\n(?!\r?\n).*)*?\bNew-York\b)(?=.*(?:\r?\n(?!\r?\n).*)*?\bBerlin\b).*(?:\r?\n(?!\r?\n).*)*

See the regex demo. Basically, we only match from the start of the line (^ with (?m) modifier), we check if there are New-York and Berlin as whole words (due to the \b word boundaries) anywhere on the lines before the first double line break and if present, match these lines.

Details

(?m)^ - start of the line
(?=.*(?:\r?\n(?!\r?\n).*)*?\bNew-York\b) - a positive lookahead that make sure there is a whole word New-York anywhere after 0+ chars other than line break chars (.*) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line
(?=.*(?:\r?\n(?!\r?\n).*)*?\bBerlin\b) - a whole word Berlin anywhere after 0+ chars other than line break chars (.*) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line
.* - match the line
(?:\r?\n(?!\r?\n).*)* - match 0+ consecutive occurrences of:
- \r?\n(?!\r?\n) - a line break (CRLF or LF) not followed with another CRLF or LF
- .* - the rest of the line.

score 2 · Answer 2 · answered Nov 21 '17 at 13:13

Using the newer regex module which supports empty splits:

import regex as re

string = """
This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph

This is paragraph #2
London, Paris
End of paragraph

This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph

This is paragraph #4
End of paragraph

This is paragraph #5
Paris, Berlin
Some other text
End of paragraph
"""

rx = re.compile(r'^$', flags = re.MULTILINE | re.VERSION1)

needle = 'New-York'

interesting = [part 
    for part in rx.split(string)
    if needle in part]

print(interesting)
# ['\nThis is paragraph #1\nNew-York, London, Paris, Berlin\nSome other text\nEnd of paragraph\n', '\nThis is paragraph #3\nNew-York, Paris, Berlin\nSome other text\nEnd of paragraph\n']

1) thank you for your answer 2) I tried it in Pythonista (same developer as Editorial ; btw, Editorial can use Python scripts) and ran into problems because - I think - it does not seem to support the newer regex module 3) your answer seems to mean that there is no pure regex (PCRE) solution. — ThG, Nov 21 '17 at 13:30

How to match paragraphs containing a specific pattern with regex?

2 Answers2