I have hundreds of text files of OCRed law journals that I'm ultimately encoding in TEI-XML. I'm doing a lot of cleaning using regex. I've been doing this cleaning using Oxygen XML editor, which does a nice job for single find-and-replace substitutions, but I would like to use a script of some sort so that I could reuse a series dozens of substitutions to deal with page headers, footnotes, common errors, and so on.
The substitutions I need to perform include line breaks. For example, I might have text like this:
<pb/>
- 6-
II faut preparer l'opinion publique et 'habituer A considérer la felon
dont les lois doivent étre faites.
that I wish to transform into this:
<pb n="6"/>
II faut preparer l'opinion publique et 'habituer A considérer la felon dont les lois doivent étre faites.
As far as I can tell, line breaks pose a challenge in substitution of this sort. See, for instance, this problem I had using gsub_dir
: R - find/replace line breaks using regex. The proposed solution from Wiktor Stribiżew worked for my narrow problem but I don't see how it can be generalized. (Also seems to be the case for the solution offered here: R Find and replace multiple scripts at once)
For instance, alongside a viable list of substitutions like
gsub_dir(dir = "bslc", pattern = "(\\w)6 ", replacement = "\\1é ")
gsub_dir(dir = "bslc", pattern = "(\\w)6(\\w)", replacement = "\\1é\\2")
gsub_dir(dir = "bslc", pattern = "(\\w)6 ", replacement = "\\1é ")
unfortunately one cannot use
gsub_dir(dir = "bslc", pattern = "<pb/>\\n+ +- ?(\\d+) ?- ", replacement = "<pb n=\\1/>")
I've looked around for Python solutions too, without much luck. Some people use apps like FAR - Find and Replace but like Oxygen they do not allow easy reuse of a list of substitutions over a folder of files.