2

I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example:

library(qdap)
line = c("one", "two one", "four phones")
e = c("one", "two")
r = c("ONE", "TWO")
line = mgsub(e,r,line)

Result:

[1] "ONE"  "TWO ONE" "four phONEs"

As you can see, each instance of e[j] in line gets substituted with r[j] and only r[j]. It works fine on a relatively small "line" and e->r vocabulary length, but when I run on length(line) = 1700000 and length(e) = 750, I reach the total allocated memory:

Reached total allocation of 7851Mb: see help(memory.size)

Any ideas how to avoid it?

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
  • 2
    Perhaps one at a time: `for(i in seq_along(e)) line <- gsub(e[i], f[i], line, fixed = TRUE)` . This should work if elements of `e` are not substrings of `f`. – G. Grothendieck Dec 08 '14 at 22:23
  • Thank you for suggestion! I tried it before. In fact, that was my main method before I installed qdap and started using mgsub. The downside of it - very very slow, as it implies an explicit loop over 750 elements in vocabulary x 1.7M rows in line – Alexey Ferapontov Dec 08 '14 at 22:32
  • 2
    Did you use `fixed=TRUE`. That might speed it up a bit. – G. Grothendieck Dec 08 '14 at 22:43
  • 1
    Yup; for loops are generally going to be slow. You can get a /slight/ speed increase out of the apply family of functions, but it's very slight and not directly applicable in this case. One nice medium would be doing it in chunks; change it so that instead of one vector of N size, you're reading in a vector of N/A size, applying mgsub, writing out, clearing the memory, and repeating - keep adjusting A downwards until you come under the memory limit. It's not elegant, but it's probably faster in runtime terms. – Oliver Keyes Dec 08 '14 at 23:06
  • Another thing to try would be to use `sub` instead of `gsub` provided there can be at most one occurrence of any component of `e` in any component of `line`. Yet another possibility is to process over `k` components of `line` at a time where integer `k` is appropriately chosen: `n <- length(line); g <- gl(n, k, n); for(lv in levels(g)) { ok <- lv == seq_along(line); line[ok] <- mgsub(e, f, line[ok]) }`. – G. Grothendieck Dec 09 '14 at 00:24
  • Yes, in a "for loop method", I did use fixed = TRUE. Still very slow (750x1.7e6). gsub is not an option, as strings can contain more than one instance of e in them. What do you think about using ddply? Convert "list" to DF, split, run few times, clean up after each time, merge. I'm very new to ddply, so I'm not sure if it will work at all. Other options? The vector will be getting pretty big in memory, but the output resulting vector will still be those 1.7M rows – Alexey Ferapontov Dec 09 '14 at 02:30
  • Quick update: I just realized, I cannot use `fixed = TRUE` in `gsub` as replacement can be a part of a bigger word, which is not acceptable (e.g. `e = "caps"`, `line = "capsule"` - "caps" in line will be replaced with some character from r) - I must use `\\be\\b`. Is there a way around it, i.e. using fixed=T and match whole words only? – Alexey Ferapontov Dec 09 '14 at 14:03
  • 1
    @AlexeyFerapontov T}your post inspired me to update the code for `mgsub` to use a `for` loop internally. This will make it quicker and less problematic with memory problems. https://github.com/trinker/qdap/issues/201 – Tyler Rinker Dec 18 '14 at 03:38
  • Hi Tyler, Thank you! Are these changes going to be "global", i.e. in the whole R community or is it your private code? It still seems that for my application `gsub` with `fixed=T` will run a bit faster, but now we have an alternative when speed is less important, and compact vectorized code is more important – Alexey Ferapontov Dec 18 '14 at 13:35

3 Answers3

3

The stringi package provides fast consistent tools for lots of string manipulation stuff:

stri_replace_all_regex(line, paste0("\\b", e, "\\b"), r, vectorize_all = FALSE)

Darn near as fast (fractions of a second different) as the other method and more straight forward.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thanks, Tyler. I'll check it out tomorrow and compare to my current method - FixedFalse with Perl. Fixed True is same speed as it turns out if I use False with Perl in my data. But Fixed False allows for more flexibility. If the stri_replace does it even faster I'd be very very happy. In my application, I do 3.1 million string x pattern operations per second with no memory overflow (as it was with mgsub) – Alexey Ferapontov Mar 01 '15 at 22:30
  • The `stri_replace_all_regex` turned out 3.5 times slower than `gsub` with `fixed=FALSE, perl=TRUE` – Alexey Ferapontov Mar 02 '15 at 14:45
  • Yeah I tried on the example above and it was comparable. Keep **stringi** in mind though it's some excellent work. – Tyler Rinker Mar 02 '15 at 16:13
1

Update to the problem (to Admins: if it doesn't deserve a separate answer - please merge it with the original one). The reason mgsub ran so fast compared to a simple for loop was that in mgsub the parameter fixed = TRUE by default, while in gsub it is FALSE by default! I just discovered it. I'd like to clarify again, that fixed=TRUE is not appropriate for me, as I do not want to replace caps in capsule, but only the whole word caps. I.e. I am forced to paste \\bs to the pattern. Here are three snippets from my code (I tested fixed=TRUE in gsub just to see the time difference, not going to use it).

#This is with mgsub. Now with fixed = FALSE!!
i = mgsub(paste("\\b",orig,"\\b",sep=""),change,i,fixed=FALSE)

#This is with a for loop. fixed=TRUE in one of lines is for test purposes only. Do not use
for(k in seq_along(orig)) {
  i = gsub(paste("\\b",orig[k],"\\b",sep=""),change[k],i)
  #i = gsub(orig[k],change[k],i,fixed=TRUE)
}

Here are the times and memory usage for all three cases on different number of input data:

N     | mgsub, fixed=F   | gsub, fixed=F    | gsub, fixed=T
--------------------------------------------------------------
100k  | 41sec, M > 2.3GB | 37sec, M > 0.9GB | 9sec, M > 0.8GB
200k  | 99sec, M > 4GB   | 74sec, M > 1.1GB | 18sec, M > 1.3GB
300k  | 132sec, M > 5.6GB| 112sec, M > 2.6GB| 28sec, M > 1.6GB 
        + disk involved

Thus, I conclude that for my application when fixed must be FALSE, there's no advantage of using mgsub. In fact, for loop is faster and does not cause memory overflow!

Thanks to all involved. I wish I could give commenters credits, but I don't know how to do it in "Comments"

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
  • 1
    The paste operation inside of the `for` loop burns time. You can add the `\\b` before you begin to `orig` but I have provided a way that you can use `fixed = TRUE` below. – Tyler Rinker Dec 09 '14 at 23:40
1

I believe you can use fixed = TRUE.

You seem to be concerned with spaces it sounds like... so just add spaces to the ends of all 3 vectors you're working with. To run this whole sequence from ## Start to ## Finish (roughly the size of your data) takes Time difference of 2.906395 secs on 1.7 million strings. The majority of time is at the end with stripping off the extra spaces.

## Recreate data
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key")
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "CAPS")

line <- rep(line, 1700000/length(line))

## Start    
line2 <- paste0(" ", line, " ")
e2 <-  paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")


for (i in seq_along(e2)) {
    line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}

gsub("^\\s|\\s$", "", line2, perl=TRUE)
## Finish

Here qdap's mgsub is not useful. The package was designed for much smaller data. Additionally, the fixed = TRUE is a sensible default because it is so much faster. The point of an add on packages is to improve upon work flow (sometimes field/task specific) through a reconfiguration of available tools. The mgsub function has some error handling too and other niceties that are useful in the analysis of transcripts that make the function hog memory. There's often the trade off between safety + syntactic sugar vs. speed.

Note that just because 2 functions are named in similar ways should not imply anything, particularly if they are found in add on packages. Even functions within base R have differently named and behaving defaults (look at the apply family of functions; this problem is less than ideal but is part of the historical evolution of R). It is incumbent upon you as a user to read documentation not make assumptions.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thank you! I like your suggestion about adding spaces. In this case I can indeed use fixed=TRUE (need to cross check carefully). Point well taken about mgsub and it's applicability, as well as different defaults in functions. – Alexey Ferapontov Dec 10 '14 at 03:22
  • Tyler, what if I have "But here's the caps, tilde, and num lock key"? The `caps,` will not be changed to `cap,` as spaces will be added after the comma, and apriori I don't know what is going to follow that full word - comma, full stop, etc. Is there a way to remove punctuation and then stitch it back? – Alexey Ferapontov Feb 05 '15 at 21:19
  • Can I suggest a new question with sample code. It's easier to test and wrap ones head around your data and desired outcome. Folks may have additional ideas since you asked this before. – Tyler Rinker Feb 05 '15 at 21:24
  • Won't it be considered duplicate? – Alexey Ferapontov Feb 05 '15 at 21:27
  • No you have a new parameter. Explain that in there. Before there was no punctuation but there is now. – Tyler Rinker Feb 05 '15 at 21:35
  • Thanks! http://stackoverflow.com/questions/28354573/r-gsub-with-fixed-t-or-f-and-special-cases – Alexey Ferapontov Feb 05 '15 at 21:40