2

I'm trying to clean a bunch of .txt files in a folder using regex. I can't seem to get R to find line breaks.

This is the code I'm using. It works for character substitution, but not for line breaks.

gsub_dir(dir = "folder_name", pattern = "\\n", replacement = "#")

I've also tried \r and various other permutations. Using a plain text editor I find all the line breaks with \n.

Will Hanley
  • 457
  • 3
  • 16
  • Actually I think you would need `"\\\n"` but it's hard to test. – NelsonGon Mar 21 '19 at 15:59
  • Like this maybe(I haven't used `cat`). `test<-paste("This is a \n","test") test gsub("\\\n","",test)`. Although in this case using `"\\n"` might not make a difference. – NelsonGon Mar 21 '19 at 16:01
  • 6
    `fortunes::fortune(365)` *When in doubt, keep adding slashes until it works.* – Gregor Thomas Mar 21 '19 at 16:02
  • 2
    You also might see a significant speed up if you use the `fixed = TRUE` argument. You don't actually need *regex*, you're only looking for exact matches. – Gregor Thomas Mar 21 '19 at 16:04
  • `"\\\n"` did not work; you are right that I don't need _regex_ for this example but I do need _regex_ + line break for the project. – Will Hanley Mar 25 '19 at 18:44

1 Answers1

3

You can't do that with xfun::gsub_dir.

Have a look at the source code:

  • The files are read in using read_utf8 that basically executes x = readLines(con, encoding = 'UTF-8', warn = FALSE),
  • Then, gsub is fed with these lines, and when all replacements are done,
  • The write_utf8 function concatenates the lines... with the LF, newline, symbol.

You need to use some custom function for that, here is "quick and dirty" one that will replace all LF symbols with #:

lbr_change_gsub_dir = function(newline = '\n', encoding = 'UTF-8', dir = '.', recursive = TRUE) {
 files = list.files(dir, full.names = TRUE, recursive = recursive)
 for (f in files) {
   x = readLines(f, encoding = encoding, warn = FALSE)
   cat(x, sep = newline, file = f)
 }
}

folder <- "C:\\MyFolder\\Here"
lbr_change_gsub_dir(newline="#", dir=folder)

If you want to be able to match multiline patterns, paste the lines collapeing them with newline and use any pattern you like:

lbr_gsub_dir = function(pattern, replacement, perl = TRUE, newline = '\n', encoding = 'UTF-8', dir = '.', recursive = TRUE) {
 files = list.files(dir, full.names = TRUE, recursive = recursive)
 for (f in files) {
   x <- readLines(f, encoding = encoding, warn = FALSE)
   x <- paste(x, collapse = newline)
   x <- gsub(pattern, replacement, x, perl = perl)
   cat(x, file = f)
 }
}

folder <- "C:\\1"
lbr_gsub_dir("(?m)\\d+\\R(.+)", "\\1", dir = folder)

This will remove lines that follow digit only lines.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you -- this works in answer to my narrow question. I still can't figure out my broader problem, which is how to use regex including line breaks over a folder of text files. I will post a new question about that. – Will Hanley Mar 25 '19 at 18:49
  • 1
    @WillHanley Please note that all you need is to `paste` the lines. See the updated answer. – Wiktor Stribiżew Mar 25 '19 at 19:37
  • I am still unsure how to do what I want to do--posted a question that I hope is clearer: https://stackoverflow.com/questions/55345453/substitution-using-regex-with-line-breaks-on-a-folder-of-text-files – Will Hanley Mar 25 '19 at 19:55