0

I am trying to come up with a regular expression that matches a specific pattern by which articles in a text file I have are arranged. (note: "|" indicates paragraph mark/line break, whereas "." indicates some non-word characters.) Here is the pattern

| 
...........................Dokument.1.von.55|
| 
|
|
..........................Some newspaper| 
| 
..........................Freitag 08. Mai 2015 
|
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
(etc..)
|
METAINFO1: IWOIOWIEOWEIWOEIWEO
| 
(etc... possibly more metainfo all capitalized) 
|
| 
.........................Copyright 2015 some publisher notes 
.........................at most one more single line containing copyright information
.........................Alle Rechte vorbehalten| 
# note: last line alternatively: All Rights Reserved 


|
(next pattern i.e. article) 

(I had to anonymize it for copyright purposes)

I have created the following regular expression for extracting single articles:

  1. match beginning of the line followed by a line break ^[\r\n]
  2. match the line containing "Dokument...." preceded by non-word characters [\W]+Dokument \d{1,} von \d{1,}
  3. match any number of line breaks [\r\n]+
  4. match any word and non-word characters (i.e. the article's text) [\w\W]+
  5. match a final newline character (last line before the next pattern starts) [r\n]
  6. match any non-word characters and the string "Alle Rechte vorbehalten" or "All Rights Reserved" [\W]+(Alle Rechte vorbehalten|All Rights Reserved)
  7. match end of the line (final line) $

Hence, the whole RE is ^[\r\n][\W]+Dokument \d{1,} von \d{1,}[\r\n]+[\w\W]+[\r\n][\W]+(Alle Rechte vorbehalten|All Rights Reserved)$

I have tested it with Textpad. When I do a backwards search with the RE it matches any single article (as needed). But when I do a forward search it matches the whole document.

At first I thought it matched any article, which then looked as If it matched everything. But then I tried the replace option with the result that my test term was replaced only once.

So the RE does not do its job. I have been working on this for some time now but can not find my mistake.

What do I do wrong? - Is there an error in my RE?

I intend to match the articles, turn the working RE into a capturing group and then replace it with some xml. But I am stuck here.

Cheers, Andrew

Andrew Tobey
  • 915
  • 3
  • 10
  • 27
  • What about "splitting" the text with the "Dokument.1.von.55" pattern? Dokument \d{1,} von \d{1,}[\d\D]*?(?=Dokument \d{1,} von \d{1,}) That way you don't have to match the copyright properly – mameluc Jun 05 '15 at 10:24

1 Answers1

1

The trick is making the part that matches the body of the article non-greedy and having very clearly defined start and end matches for articles.

re.compile(r'^\n\W+Dokument.+?\n\W+Copyright[^\n]+\n(?:[^\n]+\n)?', flags=re.S)

Just to re-iterate the assumptions:

  • Starts with a newline, followed by a line with non-word characters followed by "Dokument"
  • Contains a body full of any characters.
  • Ends with a newline, followed by a line with non-word characters followed by "Copyright" followed by more characters and a newline.
  • Can optionally contain one more line of characters followed by a newline.
Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
  • it actually ends with 2 newlines, followed by a line with non-word characters followed by "Copyright" by more characters and a newline. In addition to the copyright line it can contains 1, at most 2 lines starting with non-word characters containing some characters. In both the copyright line and the lines accompanied by that line are whitespace-characters to be expected. Does your last bullet point actually cover the 1, at most two additional lines I just mentioned??? – Andrew Tobey Jun 05 '15 at 10:24