Substitution using regex with line breaks on a folder of text files

Question

I have hundreds of text files of OCRed law journals that I'm ultimately encoding in TEI-XML. I'm doing a lot of cleaning using regex. I've been doing this cleaning using Oxygen XML editor, which does a nice job for single find-and-replace substitutions, but I would like to use a script of some sort so that I could reuse a series dozens of substitutions to deal with page headers, footnotes, common errors, and so on.

The substitutions I need to perform include line breaks. For example, I might have text like this:

<pb/>






                             - 6-
 II faut preparer l'opinion publique et 'habituer A considérer la felon
 dont les lois doivent étre faites.

that I wish to transform into this:

<pb n="6"/>
II faut preparer l'opinion publique et 'habituer A considérer la felon dont les lois doivent étre faites.

As far as I can tell, line breaks pose a challenge in substitution of this sort. See, for instance, this problem I had using gsub_dir: R - find/replace line breaks using regex. The proposed solution from Wiktor Stribiżew worked for my narrow problem but I don't see how it can be generalized. (Also seems to be the case for the solution offered here: R Find and replace multiple scripts at once)

For instance, alongside a viable list of substitutions like

gsub_dir(dir = "bslc", pattern = "(\\w)6 ", replacement = "\\1é ")
gsub_dir(dir = "bslc", pattern = "(\\w)6(\\w)", replacement = "\\1é\\2")
gsub_dir(dir = "bslc", pattern = "(\\w)6 ", replacement = "\\1é ")

unfortunately one cannot use

gsub_dir(dir = "bslc", pattern = "<pb/>\\n+ +- ?(\\d+) ?- ", replacement = "<pb n=\\1/>")

I've looked around for Python solutions too, without much luck. Some people use apps like FAR - Find and Replace but like Oxygen they do not allow easy reuse of a list of substitutions over a folder of files.

Did you mean `\\d`? Your pattern is wrtten wrongly. And you have a space at the end of the pattern, but you have no space after `- 6-`. Try `"\n+ +- ?(\\d+) ?-"`. See https://regex101.com/r/5eobdH/1 — Wiktor Stribiżew, Mar 25 '19 at 20:17
Yes, I meant `\\d`. But the digit is just an example. The problem I'm asking about is the line break. `\n` and `\\n` don't work. — Will Hanley, Mar 25 '19 at 20:26
With my second code snippet, it works, check [the answer](https://stackoverflow.com/a/55287161/3832970) here. — Wiktor Stribiżew, Mar 25 '19 at 20:27
Thank you Wiktor--you are right, the second solution you offered to my initial question works well. Very grateful. — Will Hanley, Mar 27 '19 at 14:19

Substitution using regex with line breaks on a folder of text files

0 Answers0

Linked