2

I have a pdf extract text that look like this

========================================

TITLE

subtitle

Lorem Ipsum is simply dummy text of the printing

and typesetting industry. Lorem Ipsum has been

the industry's standard dummy text ever since the 1500s.

subtitle

Lorem Ipsum is simply dummy text of the printing and

typesetting industry. Lorem Ipsum has been the industry's

standard dummy text ever since the 1500s.

========================================

there is a new line ('\n') at the end of each line.

I am trying to find a given sentence using regex and extract the paragraph in which it was found. A paragraph is anything between two consecutive new lines (\n\n). Note that it has to be done using the lazy method.

FYI:

  1. The sentence can start in a line and end in another

  2. I cannot change the given text format

  3. There is a limit number of lines to return, so if I cant find \n\n after 10 lines up or down, I have to return 10 lines before and 10 lines after the regex keyword

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • 2
    Why do you need to use a regex? This doesn't sound like a regex problem. – lxop May 25 '20 at 20:30
  • Pls, elaborate on what do you mean by the lazy method. – Askold Ilvento May 25 '20 at 20:38
  • 1
    This sounds like homework/assignment - is it? What code have you written? – DisappointedByUnaccountableMod May 25 '20 at 20:46
  • It is acctually an automation I'm working on. I have a robot that extracts pages of pdf files and parse then to a string. After that, I have to find a sentence inside my parsed text and return the paragraph in wich this sentence is contained. What i had in mind was: I first find the line containing my sentence and then append the lines above and below until i find an empty line (only a \n) – Bruno Neves May 26 '20 at 21:01
  • So, I have to use a regex to find the sentence inside the text. One of the problems is that my sentence can start in a line and end in another, so I am not able to search for it line by line, it must be more complex than that – Bruno Neves May 26 '20 at 21:04

1 Answers1

1

something like this might get you started:

import re

data = """
ggg

aaa aaa aaa
more bla...

========================================

TITLE

subtitle

Lorem Ipsum is simply dummy text of the printing

and typesetting industry. Lorem Ipsum has been

the industry's standard dummy text ever since the 1500s.

subtitle

Lorem Ipsum is simply more bla of the printing and

typesetting industry. Lorem Ipsum has been the industry's

standard dummy text ever since the 1500s.

========================================

bla bla bla bla bla
more bla...

yet more bla
"""

if __name__ == "__main__":
    to_search = "more bla"
    print(re.findall(r"(?:(?<!^\n)\n(?!^\n)|[^\n])*"+re.escape(to_search)+r"(?:(?<!^\n)\n(?!^\n)|[^\n])*", data, re.DOTALL|re.MULTILINE|re.IGNORECASE))

important are the DOTALL and MULTILINE parameters to match newlines and search across lines. and also the lookaheads to detect 2 successive \n characters...

mrxra
  • 852
  • 1
  • 6
  • 9