I am trying to come up with a regular expression that matches a specific pattern by which articles in a text file I have are arranged. (note: "|" indicates paragraph mark/line break, whereas "." indicates some non-word characters.) Here is the pattern
|
...........................Dokument.1.von.55|
|
|
|
..........................Some newspaper|
|
..........................Freitag 08. Mai 2015
|
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
(etc..)
|
METAINFO1: IWOIOWIEOWEIWOEIWEO
|
(etc... possibly more metainfo all capitalized)
|
|
.........................Copyright 2015 some publisher notes
.........................at most one more single line containing copyright information
.........................Alle Rechte vorbehalten|
# note: last line alternatively: All Rights Reserved
|
(next pattern i.e. article)
(I had to anonymize it for copyright purposes)
I have created the following regular expression for extracting single articles:
- match beginning of the line followed by a line break
^[\r\n]
- match the line containing "Dokument...." preceded by non-word characters
[\W]+Dokument \d{1,} von \d{1,}
- match any number of line breaks
[\r\n]+
- match any word and non-word characters (i.e. the article's text)
[\w\W]+
- match a final newline character (last line before the next pattern starts)
[r\n]
- match any non-word characters and the string "Alle Rechte vorbehalten" or "All Rights Reserved"
[\W]+(Alle Rechte vorbehalten|All Rights Reserved)
- match end of the line (final line)
$
Hence, the whole RE is ^[\r\n][\W]+Dokument \d{1,} von \d{1,}[\r\n]+[\w\W]+[\r\n][\W]+(Alle Rechte vorbehalten|All Rights Reserved)$
I have tested it with Textpad. When I do a backwards search with the RE it matches any single article (as needed). But when I do a forward search it matches the whole document.
At first I thought it matched any article, which then looked as If it matched everything. But then I tried the replace option with the result that my test term was replaced only once.
So the RE does not do its job. I have been working on this for some time now but can not find my mistake.
What do I do wrong? - Is there an error in my RE?
I intend to match the articles, turn the working RE into a capturing group and then replace it with some xml. But I am stuck here.
Cheers, Andrew