-1

After extract text from PDFs files using pdftotext, I am trying to recover some their titles and respective contents.

This batch of files have a pattern of a new line followed by a roman number followed (or not) by dot or hyphen and the title followed by break line.

So I tried this pattern:

^[^\S\n]*([CLXVI]{1,7})\.\s?(.*?)\n([\S\s]*)(?=[CLXVI]{1,7})

But did not worked as expected:

https://regex101.com/r/vX4aB4/1

The expected result was something like:

group title -> Breve Síntese da Demanda
group content -> Lorem ipsum dolor ... faucibus.
group title -> Bla Bla bla
group content -> Lorem ipsum dolor ... faucibus.
group title -> Do Mérito
group content -> Lorem ipsum dolor ... commodo.
group title -> Conclusão
group content -> Lorem ipsum dolor ... .

So how Can I improve that to recover properly each title and their respective contents?

celsowm
  • 846
  • 9
  • 34
  • 59
  • 1
    Can you clarify what is the expected result of a regex? Should the regex match a title + its content? – Albina Dec 26 '22 at 13:24

1 Answers1

2

You can use a negative lookahead to prevent skipping over, e.g.

^(\h*+[CLXVI]{1,7}\.)\h*(.+)\s*((?:(?!(?1)).*\R?)*)

See your updated demo at regex101 - Use in (?m) multiline mode


The relevant part (?!(?1)) prevents skipping over first group pattern.
This is a PCRE regex, it uses group reference and possessive quantifier.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46