Regex: How to recover roman numbered titles and their respective contents

Question

After extract text from PDFs files using pdftotext, I am trying to recover some their titles and respective contents.

This batch of files have a pattern of a new line followed by a roman number followed (or not) by dot or hyphen and the title followed by break line.

So I tried this pattern:

^[^\S\n]*([CLXVI]{1,7})\.\s?(.*?)\n([\S\s]*)(?=[CLXVI]{1,7})

But did not worked as expected:

https://regex101.com/r/vX4aB4/1

The expected result was something like:

group title -> Breve Síntese da Demanda
group content -> Lorem ipsum dolor ... faucibus.
group title -> Bla Bla bla
group content -> Lorem ipsum dolor ... faucibus.
group title -> Do Mérito
group content -> Lorem ipsum dolor ... commodo.
group title -> Conclusão
group content -> Lorem ipsum dolor ... .

So how Can I improve that to recover properly each title and their respective contents?

Can you clarify what is the expected result of a regex? Should the regex match a title + its content? — Albina, Dec 26 '22 at 13:24

bobble bubble · Accepted Answer · 2022-12-26T14:29:44.000

2

You can use a negative lookahead to prevent skipping over, e.g.

^(\h*+[CLXVI]{1,7}\.)\h*(.+)\s*((?:(?!(?1)).*\R?)*)

See your updated demo at regex101 - Use in (?m) multiline mode

The relevant part (?!(?1)) prevents skipping over first group pattern.
This is a PCRE regex, it uses group reference and possessive quantifier.

edited Dec 26 '22 at 14:29

answered Dec 26 '22 at 14:03

bobble bubble

16,888
3
27
46

1

Np sir, take some rest we all need it cheers – RavinderSingh13 Jan 12 '23 at 00:38

Regex: How to recover roman numbered titles and their respective contents

1 Answers1