1

I have hundreds of text files of OCRed law journals that I'm ultimately encoding in TEI-XML. I'm doing a lot of cleaning using regex. I've been doing this cleaning using Oxygen XML editor, which does a nice job for single find-and-replace substitutions, but I would like to use a script of some sort so that I could reuse a series dozens of substitutions to deal with page headers, footnotes, common errors, and so on.

The substitutions I need to perform include line breaks. For example, I might have text like this:

<pb/>






                             - 6-
 II faut preparer l'opinion publique et 'habituer A considérer la felon
 dont les lois doivent étre faites.

that I wish to transform into this:

<pb n="6"/>
II faut preparer l'opinion publique et 'habituer A considérer la felon dont les lois doivent étre faites.

As far as I can tell, line breaks pose a challenge in substitution of this sort. See, for instance, this problem I had using gsub_dir: R - find/replace line breaks using regex. The proposed solution from Wiktor Stribiżew worked for my narrow problem but I don't see how it can be generalized. (Also seems to be the case for the solution offered here: R Find and replace multiple scripts at once)

For instance, alongside a viable list of substitutions like

gsub_dir(dir = "bslc", pattern = "(\\w)6 ", replacement = "\\1é ")
gsub_dir(dir = "bslc", pattern = "(\\w)6(\\w)", replacement = "\\1é\\2")
gsub_dir(dir = "bslc", pattern = "(\\w)6 ", replacement = "\\1é ")

unfortunately one cannot use

gsub_dir(dir = "bslc", pattern = "<pb/>\\n+ +- ?(\\d+) ?- ", replacement = "<pb n=\\1/>")

I've looked around for Python solutions too, without much luck. Some people use apps like FAR - Find and Replace but like Oxygen they do not allow easy reuse of a list of substitutions over a folder of files.

Will Hanley
  • 457
  • 3
  • 16

0 Answers0