Update to the problem (to Admins: if it doesn't deserve a separate answer - please merge it with the original one). The reason mgsub
ran so fast compared to a simple for loop was that in mgsub
the parameter fixed = TRUE
by default, while in gsub
it is FALSE
by default! I just discovered it.
I'd like to clarify again, that fixed=TRUE
is not appropriate for me, as I do not want to replace caps
in capsule
, but only the whole word caps
. I.e. I am forced to paste \\b
s to the pattern. Here are three snippets from my code (I tested fixed=TRUE
in gsub
just to see the time difference, not going to use it).
#This is with mgsub. Now with fixed = FALSE!!
i = mgsub(paste("\\b",orig,"\\b",sep=""),change,i,fixed=FALSE)
#This is with a for loop. fixed=TRUE in one of lines is for test purposes only. Do not use
for(k in seq_along(orig)) {
i = gsub(paste("\\b",orig[k],"\\b",sep=""),change[k],i)
#i = gsub(orig[k],change[k],i,fixed=TRUE)
}
Here are the times and memory usage for all three cases on different number of input data:
N | mgsub, fixed=F | gsub, fixed=F | gsub, fixed=T
--------------------------------------------------------------
100k | 41sec, M > 2.3GB | 37sec, M > 0.9GB | 9sec, M > 0.8GB
200k | 99sec, M > 4GB | 74sec, M > 1.1GB | 18sec, M > 1.3GB
300k | 132sec, M > 5.6GB| 112sec, M > 2.6GB| 28sec, M > 1.6GB
+ disk involved
Thus, I conclude that for my application when fixed
must be FALSE
, there's no advantage of using mgsub
. In fact, for
loop is faster and does not cause memory overflow!
Thanks to all involved. I wish I could give commenters credits, but I don't know how to do it in "Comments"