regular expression: match anything between specific pattern

Question

I am trying to come up with a regular expression that matches a specific pattern by which articles in a text file I have are arranged. (note: "|" indicates paragraph mark/line break, whereas "." indicates some non-word characters.) Here is the pattern

| 
...........................Dokument.1.von.55|
| 
|
|
..........................Some newspaper| 
| 
..........................Freitag 08. Mai 2015 
|
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
(etc..)
|
METAINFO1: IWOIOWIEOWEIWOEIWEO
| 
(etc... possibly more metainfo all capitalized) 
|
| 
.........................Copyright 2015 some publisher notes 
.........................at most one more single line containing copyright information
.........................Alle Rechte vorbehalten| 
# note: last line alternatively: All Rights Reserved 


|
(next pattern i.e. article)

(I had to anonymize it for copyright purposes)

I have created the following regular expression for extracting single articles:

match beginning of the line followed by a line break ^[\r\n]
match the line containing "Dokument...." preceded by non-word characters [\W]+Dokument \d{1,} von \d{1,}
match any number of line breaks [\r\n]+
match any word and non-word characters (i.e. the article's text) [\w\W]+
match a final newline character (last line before the next pattern starts) [r\n]
match any non-word characters and the string "Alle Rechte vorbehalten" or "All Rights Reserved" [\W]+(Alle Rechte vorbehalten|All Rights Reserved)
match end of the line (final line) $

I have tested it with Textpad. When I do a backwards search with the RE it matches any single article (as needed). But when I do a forward search it matches the whole document.

At first I thought it matched any article, which then looked as If it matched everything. But then I tried the replace option with the result that my test term was replaced only once.

So the RE does not do its job. I have been working on this for some time now but can not find my mistake.

What do I do wrong? - Is there an error in my RE?

I intend to match the articles, turn the working RE into a capturing group and then replace it with some xml. But I am stuck here.

Cheers, Andrew

What about "splitting" the text with the "Dokument.1.von.55" pattern? Dokument \d{1,} von \d{1,}[\d\D]*?(?=Dokument \d{1,} von \d{1,}) That way you don't have to match the copyright properly — mameluc, Jun 05 '15 at 10:24

score 1 · Answer 1 · answered Jun 05 '15 at 10:16

1

The trick is making the part that matches the body of the article non-greedy and having very clearly defined start and end matches for articles.

re.compile(r'^\n\W+Dokument.+?\n\W+Copyright[^\n]+\n(?:[^\n]+\n)?', flags=re.S)

Just to re-iterate the assumptions:

Starts with a newline, followed by a line with non-word characters followed by "Dokument"
Contains a body full of any characters.
Ends with a newline, followed by a line with non-word characters followed by "Copyright" followed by more characters and a newline.
Can optionally contain one more line of characters followed by a newline.

answered Jun 05 '15 at 10:16

Brendan Abel

35,343
14
88
118

it actually ends with 2 newlines, followed by a line with non-word characters followed by "Copyright" by more characters and a newline. In addition to the copyright line it can contains 1, at most 2 lines starting with non-word characters containing some characters. In both the copyright line and the lines accompanied by that line are whitespace-characters to be expected. Does your last bullet point actually cover the 1, at most two additional lines I just mentioned??? – Andrew Tobey Jun 05 '15 at 10:24

regular expression: match anything between specific pattern

1 Answers1