6

I have something like 700,000 files in a folder where I need to find and replace multiple strings with different other strings (all 4 caracters codes). It is unsure if a string is present or not in a file. I'm trying to use gsub but I can't find how to do it with regular expressions. Can someone tell me a good and efficient way to handle this task?

This is the code I've used so far. It worked well with only one y <- gsub(...) instruction but doesn't work for my purpose, obviously because only the last gsub instruction is taken into account for defining the y variable...

chm_files <- list.files(getwd(), pattern=("^[[:digit:]]*.chm$"), full.names=F)

for(chm_file in chm_files) {
  x <- readLines(chm_file)
  y <- gsub("AG02|AG07|AG05|AG18|AG19|AG08|AG09|AG17", "AGRL", x)
  y <- gsub("SB28|SB42|SB43|SB33|SB41|SB34|SB39|SB35", "SWHT", x)
  y <- gsub("WB28|WB42|WB43|WB32|WB09|WB33|WB41|WB26", "BARL", x)
  y <- gsub("WW02|WW25|WW08|WW31|WW05|WW28|WW19|WW42", "WWHT", x)
  cat(y, file=chm_file, sep="\n")
}
DirtStats
  • 559
  • 9
  • 29
Marc
  • 651
  • 5
  • 16
  • 2
    what platform are you on? why not use a [shell script](http://www.cyberciti.biz/faq/unix-linux-replace-string-words-in-many-files/)? you dont have to use r for everything – rawr Jan 31 '15 at 15:20
  • I use win 8.1... I know nearly nothing about shell script. This task is only a tiny part of the code I have to use to analysis my data and I do everthing with R. Maybe shell script can be integrated in my code, I don't know.. will check, thanks for the idea.. – Marc Jan 31 '15 at 15:28
  • 3
    If you always assigned back to `x` rather than `y` you would not loose the earlier corrections. – IRTFM Jan 31 '15 at 15:51
  • 1
    There are several free text editors with the ability to do "batch" editing of files (under windows). That's probably cleaner, faster, and easier than coding up `R` – Carl Witthoft Jan 31 '15 at 15:55
  • 1
    actually I will do it many times and it's part of other tasks in R so I prefer to code it once for all and let it run to get the results without any intervention.. – Marc Jan 31 '15 at 16:08
  • I found a *.CHM file extension in your R Code sample. CHM are compiled and binary. Are these files really Compiler Help Modules (CHM)? – help-info.de Feb 11 '15 at 06:33
  • No, they are ASCII files containing parameters for a model. – Marc Feb 13 '15 at 14:35
  • 1
    If you ever have to look at it again, I would also make it easier to work with gsub("AG(02|05|07|08|09|17|18|19)", "AGRL", x) or gsub("AG(0[257-9]|1[7-9])", "AGRL", x) or gsub("AG(0[25]|[0-1][7-9])", "AGRL", x) whatever structure makes sense for the context. – ARobertson Feb 25 '15 at 07:59
  • Good point, thank you! it will simplify a bit my code! – Marc Feb 26 '15 at 20:26

2 Answers2

4

I am sure there are already numerous pre-built functions for this task in various R-packages, but anyhow I just cooked this one up for myself and others to use/modify. Apart from the tasks request above it also prints out a tracking log of the count of all changes made across files function: multi_replace.

Here is some example code of how it should be run

# local directory with files you want to work with
setwd("C:/Users/DW/Desktop/New folder")
# get a list of files based on a pattern of interest e.g. .html, .txt, .php 
filer = list.files(pattern=".php")
# f - list of original string values you want to change
f <- c("localhost","dbtest","root","oldpassword")
# r - list of values to replace the above values with
# make sure the indexing of f & r
r <- c("newhost", "newdb", "newroot", "newpassword")

# Run the function and watch all your changes take place ;)
tracking_sheet <- multi_replace(filer, f, r)
tracking_sheet
Matthew Bayly
  • 556
  • 5
  • 7
-2
setwd("D:/R Training Material Kathmandu/File renaming procedures")
filer = list.files(pattern="2016")
f <- c("DATA,","$")
r <- c("","")
tracking_sheet <- multi_replace(filer, f, r)
tracking_sheet

I used the above script but the code failed to replace the $ sign among all files