I'm starting to do a lot of string matching in my work and I'm curious as to what the differences between the three functions are, and in what situations someone would use one over the other.
-
2Have you checked official documentation? https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_detect – pogibas Aug 08 '19 at 12:34
-
My understanding is that in terms of outcomes they are pretty similar. However, the `stringr` package really just provides consistent/user friendly functions which are wrappers of the `stringi` package. My understanding is that these tend to be faster. – Ben G Aug 08 '19 at 12:50
-
I would start by digging into `?str_detect`, `?grepl`, `?grep`, `?str_which`, `?match` / ``%in%``. And definitely check out the stringr package documentation. – Andrew Aug 08 '19 at 12:55
1 Answers
stringr
is a "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package" (from package description). The main advantage of stringi
is the incredible speed of the package compared to base R
- which stringr
inherits for the most part. The output of the functions is the same in base as in stringr.
I use stringi
to generate some random text for demonstration:
library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)
grep
provides the position of a pattern in the character vector, just as it's equivalent str_which
does:
grep("Lorem", sample_small)
#> [1] 1 9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1] 1 9 14 32 45 50 65 93 94
grepl
/str_detect
on the other hand give you the information for each element of the vector, if it contains the string or not.
grepl("Lorem", sample_small)
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
There are many scenarios where the different outcome could make a difference for you. I'm usually using grepl
if I'm interested in adding a new column to a data.frame that contains information on whether a different column contains a pattern. grepl
makes this easier as it has the same length as the input variable:
df <- data.frame(sample = sample_small,
stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)
This way, some more elaborate tests are possible:
which(df$lorem & df$ipsum)
#> [1] 1 5 15 53 71 75
Or directly as a filter
rule:
df %>%
filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))
Now in terms of why to use stringr
over base, I think there are two arguments: different syntax makes it a little bit easier to use stringr
with pipes
library(dplyr)
sample_small %>%
str_detect("Lorem")
compared to:
sample_small %>%
grepl("Lorem", .)
And stringr
is roughly 5x faster than base (for the two functions we are looking at):
sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
base = grep("Lorem", sample_big),
stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 674ms 674ms 1.48 415KB 0
#> 2 stringr 141ms 142ms 6.99 806KB 0
bench::mark(
base = grepl("Lorem", sample_big),
stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 679ms 679ms 1.47 391KB 0
#> 2 stringr 146ms 148ms 6.76 391KB 0
The difference is even more striking when we look for exact matches (the default is to look for regular expressions)
bench::mark(
base = grepl("Lorem", sample_big, fixed = TRUE),
stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 336ms 338.1ms 2.96 391KB 0
#> 2 stringr 12.4ms 12.6ms 79.1 417KB 0
However, I think the base functions have a certain charm to them, which is why I often still use them when writing code quickly. The option fixed = TRUE
is one example. Wrapping fixed()
around the pattern feels just a little awkward to me. Other examples would be the option value = TRUE
in grep
(I let you figure that one out yourself) and finally ignore.case = TRUE
which, again looks a little awkward in stringr
:
str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#> [1] 1 5 6 8 9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97
However, the reason this is awkward for me is probably just because I used base R
for a while before learning stringr
.
Another point to consider is that with stringi
, you have even more features overall. So if you are determined to get into string manipulation, you might start to learn that package right away - although there are admittedly less tutorials and it might be a bit tougher to figure some things out.

- 11,727
- 1
- 23
- 45
-
Thank you this was super helpful and gave me a lot to think and read about moving forward! – Jeffrey Brabec Aug 08 '19 at 16:21
-
I cannot reproduce your benchmarks. For both str_detect and str_which, I see no significant difference between them and their base counterparts. Using `fixed()` is faster than `fixed = TRUE`; however setting `perl = TRUE` in base is much faster than any of the stringr versions. On Windows, R 4.0.3. – Hugh Oct 25 '20 at 02:31
-
That's quite strange. I repeated the benchmarks just now and get pretty much the same results under R 4.0.3, `stringr` 1.4.0. Also tested it on rstudio.cloud in case my computer was doing something weird. I can confirm though that `perl = TRUE` changes the picture. I never really use it though. Maybe there is a hidden trade-off? – JBGruber Oct 25 '20 at 08:26
-
In the first part of your post including the first code block you write about the packages `stringi` and `stringr`, but the way it is written now sounds like you might have mixed the packages up. – saQuist Oct 29 '21 at 11:33
-
I added a small clarification, but I don't know what you mean. stringr is built on top of stringi. Not the other way around... – JBGruber Oct 29 '21 at 15:56