1

I have some content in the format:

text = """Pos no
...
... 25/gm
The Text to be 
...
excluded
Pos no
...
... 46 kg
The Text to be 
...
excluded
Pos no
...
... 46 xunit
End of My Text

Where, Pos no... 25/gm - It is a sort of tabular structure from which I have to extract the values.

The Text to be ... excluded - This has constant start (lets say The Text to be) but not definite end i.e excluded might not be present.

End of My Text - This text will always be present.

I want a list with the tabular content only i.e.

["Pos no
...
... 25/gm",
"Pos no
...
... 46 kg",
"Pos no
...
... 46 xunit"]

Here is my try but its not fetching the right list:

re.findall(r'(Pos no .+?)(?: |The Text to be|End of My Text)', text, re.DOTALL | re.M)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Laxmikant
  • 2,046
  • 3
  • 30
  • 44

1 Answers1

2

You may use

re.findall(r'(?sm)(Pos no\r?\n.+?)[\r\n]+(?:The Text to be|End of My Text)', text)

See the Python demo

Note that Pos no has no space, but your pattern required it. Also, matching the right-hand context only when it is at the start of a line will make matching safer.

Pattern details

  • (?sm) - re.DOTALL and re.MULTILINE inline modifiers (for shorter code)
  • (Pos no\r?\n.+?) - Group 1 (what is returned by re.findall):
    • Pos no - a literal substring
    • \r?\n - a CRLF or LF line break
    • .+? - any 1+ chars, as few as possible up to the leftmost occurrence of the subsequent subpatterns
  • [\r\n]+ - 1+ line break chars
  • (?:The Text to be|End of My Text) - either of the two substrings, The Text to be or End of My Text.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • [Another demo with the same approach](https://ideone.com/d8X9DS), just different printing of the results. – Wiktor Stribiżew Jun 06 '18 at 14:18
  • Thank you for your efforts. But looks like somehow its not working with actual client data. One guess I have done is, actual data contains `utf-8` characters, so wondering does it make any difference when text has `utf-8` characters in it. – Laxmikant Jun 07 '18 at 05:17
  • @Laxmikant Do you mean there are Unicode line breaks? Replace `[\r\n]` with `[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]` and `\r?\n` with `\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]`. Also, are `Pos no`, `The Text to be` and `End of My Text` on separate fiull lines? Add `\s*` to allow leading or trailing whitespaces. See [this regex demo](https://regex101.com/r/IEP3kC/1). – Wiktor Stribiżew Jun 07 '18 at 07:00