4

I have the following paragraphs :

This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph

This is paragraph #2
London, Paris
End of paragraph

This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph

This is paragraph #4
End of paragraph

This is paragraph #5
Paris, Berlin
Some other text
End of paragraph

How can I, with a regex, match the paragraphs containing e.g. New-York (#1 and #3) or London (#1, #2) ? or even New-York AND Berlin (#1, #3) ?

I have found an answer in S.O.

How match a paragraph using regex

which allows me to match the paragraphs (all the text between two blank lines).

But I cannot figure (my regex skills are… limited) how to match the paragraphs containing a specific pattern, and only those paragraphs.

Thanks in advance for your help

NB : the idea is to use the answer in the Editorial IOS app to fold the answers NOT containing the pattern.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
ThG
  • 2,361
  • 4
  • 22
  • 33
  • Which programming language do you use? It might be easier to split the paragraphs first (on empty lines) and then look for `New-York` in them. – Jan Nov 21 '17 at 13:01
  • Which flavor of regex? Python? Do you have to use regex in one line? The answer you link to splits on "\n\n". – kabanus Nov 21 '17 at 13:06
  • @kabanus : Python – ThG Nov 21 '17 at 13:07
  • @Jan : I do not want to split the paragraphs : I want to keep thr entire paragraphs containing the specified pattern, and those paragraphs only. – ThG Nov 21 '17 at 13:11
  • Just a couple of raw ideas: https://regex101.com/r/pWP0CK/1, https://regex101.com/r/pWP0CK/2 and https://regex101.com/r/pWP0CK/3 – Wiktor Stribiżew Nov 21 '17 at 13:40
  • A bit optimized... https://regex101.com/r/BWZkb9/1/ Now, you see it is better to resort to the code to check for things like the city names. Or at least run 2 separate regexes, one to extract and the other to filter. Or use the `regex` module as Jan suggests. – Wiktor Stribiżew Nov 21 '17 at 13:47
  • @Wiktor Stribiżew : superb ! Could you make it an answer in order to allow me to to accept it ? (I have thanked Jan by signaling his answer was useful). Thanks a lot – ThG Nov 21 '17 at 13:57

2 Answers2

4

I see you might have no access to the Python code itself if you plan to use the pattern in the Editorial iOS app.

Then, all I can suggest is the pattern like

(?m)^(?=.*(?:\r?\n(?!\r?\n).*)*?\bNew-York\b)(?=.*(?:\r?\n(?!\r?\n).*)*?\bBerlin\b).*(?:\r?\n(?!\r?\n).*)*

See the regex demo. Basically, we only match from the start of the line (^ with (?m) modifier), we check if there are New-York and Berlin as whole words (due to the \b word boundaries) anywhere on the lines before the first double line break and if present, match these lines.

Details

  • (?m)^ - start of the line
  • (?=.*(?:\r?\n(?!\r?\n).*)*?\bNew-York\b) - a positive lookahead that make sure there is a whole word New-York anywhere after 0+ chars other than line break chars (.*) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line
  • (?=.*(?:\r?\n(?!\r?\n).*)*?\bBerlin\b) - a whole word Berlin anywhere after 0+ chars other than line break chars (.*) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line
  • .* - match the line
  • (?:\r?\n(?!\r?\n).*)* - match 0+ consecutive occurrences of:
    • \r?\n(?!\r?\n) - a line break (CRLF or LF) not followed with another CRLF or LF
    • .* - the rest of the line.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

Using the newer regex module which supports empty splits:

import regex as re

string = """
This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph

This is paragraph #2
London, Paris
End of paragraph

This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph

This is paragraph #4
End of paragraph

This is paragraph #5
Paris, Berlin
Some other text
End of paragraph
"""

rx = re.compile(r'^$', flags = re.MULTILINE | re.VERSION1)

needle = 'New-York'

interesting = [part 
    for part in rx.split(string)
    if needle in part]

print(interesting)
# ['\nThis is paragraph #1\nNew-York, London, Paris, Berlin\nSome other text\nEnd of paragraph\n', '\nThis is paragraph #3\nNew-York, Paris, Berlin\nSome other text\nEnd of paragraph\n']
Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1) thank you for your answer 2) I tried it in Pythonista (same developer as Editorial ; btw, Editorial can use Python scripts) and ran into problems because - I think - it does not seem to support the newer regex module 3) your answer seems to mean that there is no pure regex (PCRE) solution. – ThG Nov 21 '17 at 13:30