Microbenchmarking base R and three packages on string pattern substitution

Question

My question is whether my method and conclusion are correct.

As part of my learning regular expressions, I wanted to figure out in which order to learn the various alternatives (base R and packages). I thought it might help to learn the relative speeds of the alternative functions. So, I created a string vector and called what I hope are equivalent expressions.

sites <- c("http://grand.test.com/", "https://example.com/",  
           "http://.big.time.bhfs.com/", "http://test.blogs.mvalaw.com/")
vec <- rep(x = sites, times = 1000) # creating a longish vector

base <- gsub("http:", "", vec, perl = TRUE)
stringr <- str_replace_all(vec, "http:", replacement = "")
stringi <- stri_replace_all_regex(str = vec, pattern = "http:", replacement = "")
qdap <- genX(text.var = vec, "http:", "")

Then I benchmarked the four methods using the microbenchmarking package.

test <- microbenchmark(base <- gsub("http:", "", vec, perl = TRUE),
                      stringr <- str_replace_all(vec, "http:", replacement = ""),
                      stringi <- stri_replace_all_regex(str = vec, pattern = "http:", replacement = ""),
                      qdap <- genX(text.var = vec, "http:", ""),
                      times = 100)

Am I correct that base R's gsub is by far the fastest (I shortened the expr names)?

 expr        min         lq
 base    1.697001   1.739393
 stringr 3.814348   3.928360
 stringi 5.888857   6.172212
 qdap 120.670037 124.624946
     median         uq        max neval
   1.765051   1.833770   2.976780   100
   3.979453   4.123138   7.032091   100
   6.276407   6.500412   7.634943   100
 127.493293 130.923663 173.155253   100

The median times are very significantly different, especially for qdap

These functions each have their own purpose. But if you look at the help file for `str_replace_all` for example, you'll find that it links to, and wraps, `gsub`, meaning that it calls `gsub` at some point. In this case `gsub` is most likely the fastest because it calls a `.Internal` function, which consists of C code and R code. Functions that call `.Internal` or are `.Primitive` are generally the fastest because of the internal C code — Rich Scriven, Jul 20 '14 at 02:54
A couple side notes... `sub` is slightly faster than `gsub`, and it looks like it would be sufficient here. Note that "http:" does not match "https:". You could change the pattern to "https?:" to optionally include the s — GSee, Jul 20 '14 at 03:15
Also note that, in this case, using `fixed = TRUE` instead of `perl = TRUE` makes `gsub` even faster (by only 0.7 milliseconds on my machine. But still, faster). — Rich Scriven, Jul 20 '14 at 03:27
@RichardScriven: As to your first comment, that means str_replace_all is a wrapper? A function that relies fundamentally on a more basic function but adds some conveniences or consistencies for the user? Second comment: good point. That my vector consisted of single instances of urls to search was an accident. You are right that sub wold be faster and the optional character ? would help. Finally, re the 3rd point, as I understand it, that regex engines are slower than fixed searches. Question: how do I get the result system time and function time = elapsted time? — lawyeR, Jul 20 '14 at 10:57
Your use of `qdap` here surprised me that it actually worked. `genX` is meant to remove items between 2 markers (hence the `left` and `right` arguments). The function is meant to reduce programming time with regex not computational time and as a wrapper for base functions I'd expect it to be slower anyway. In any event I would not use `genX` for a task that `gsub` (or the `stringi`/`stringr` packages) is much better suited for. — Tyler Rinker, Jul 20 '14 at 13:33
@TylerRinker: thank you for the explanation. I was trying to learn two things: alternatives to pattern substitution and microbenchmarking. Your comment about qdap's genX contributes to the first effort and makes its inclusion in this benchmarking sort of silly. By the way, how should one remember the function of genX from that cryptic name? — lawyeR, Jul 20 '14 at 13:51
@lawyeR, I cannot speak for the package's author but from the `stringr` DESCRIPTION file: *"Description: stringr is a set of simple wrappers that make R's string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA's and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions."* — Rich Scriven, Jul 20 '14 at 15:40
@lawyeR The name for `genX` isn't probably the best but it grew out of the related family and `bracketXtract` (found in the same documentation). If you have familiarity with these functions then `gen` (general) + `X` (remove) becomes less cryptic and fairly easy to remember as it is a generalized version of `bracketX`. Here's an example when `genX`/`genXtract` is pretty fast (better used): http://stackoverflow.com/a/24779936/1000343 — Tyler Rinker, Jul 20 '14 at 18:38
@RichardScriven: I have been remiss. You helped and commented well here, so if you create an answer based on that, I will accept it. Thanks — lawyeR, Sep 13 '14 at 02:06

Microbenchmarking base R and three packages on string pattern substitution

0 Answers0

Linked