What's the difference between the str_detect function in stringer and grepl and grep?

Question

I'm starting to do a lot of string matching in my work and I'm curious as to what the differences between the three functions are, and in what situations someone would use one over the other.

Have you checked official documentation? https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_detect — pogibas, Aug 08 '19 at 12:34
My understanding is that in terms of outcomes they are pretty similar. However, the `stringr` package really just provides consistent/user friendly functions which are wrappers of the `stringi` package. My understanding is that these tend to be faster. — Ben G, Aug 08 '19 at 12:50
I would start by digging into `?str_detect`, `?grepl`, `?grep`, `?str_which`, `?match` / ``%in%``. And definitely check out the stringr package documentation. — Andrew, Aug 08 '19 at 12:55

JBGruber · Accepted Answer · 2021-10-29T15:54:46.410

stringr is a "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package" (from package description). The main advantage of stringi is the incredible speed of the package compared to base R - which stringr inherits for the most part. The output of the functions is the same in base as in stringr.

I use stringi to generate some random text for demonstration:

library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)

grep provides the position of a pattern in the character vector, just as it's equivalent str_which does:

grep("Lorem", sample_small)
#> [1]  1  9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1]  1  9 14 32 45 50 65 93 94

grepl/str_detect on the other hand give you the information for each element of the vector, if it contains the string or not.

grepl("Lorem", sample_small)
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE

There are many scenarios where the different outcome could make a difference for you. I'm usually using grepl if I'm interested in adding a new column to a data.frame that contains information on whether a different column contains a pattern. grepl makes this easier as it has the same length as the input variable:

df <- data.frame(sample = sample_small,
                 stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)

This way, some more elaborate tests are possible:

which(df$lorem & df$ipsum)
#> [1]  1  5 15 53 71 75

Or directly as a filter rule:

df %>% 
  filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))

Now in terms of why to use stringr over base, I think there are two arguments: different syntax makes it a little bit easier to use stringr with pipes

library(dplyr)
sample_small %>% 
  str_detect("Lorem")

compared to:

sample_small %>% 
  grepl("Lorem", .)

And stringr is roughly 5x faster than base (for the two functions we are looking at):

sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
  base = grep("Lorem", sample_big),
  stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          674ms    674ms      1.48     415KB        0
#> 2 stringr       141ms    142ms      6.99     806KB        0


bench::mark(
  base = grepl("Lorem", sample_big),
  stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          679ms    679ms      1.47     391KB        0
#> 2 stringr       146ms    148ms      6.76     391KB        0

The difference is even more striking when we look for exact matches (the default is to look for regular expressions)

bench::mark(
  base = grepl("Lorem", sample_big, fixed = TRUE),
  stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          336ms  338.1ms      2.96     391KB        0
#> 2 stringr      12.4ms   12.6ms     79.1      417KB        0

However, I think the base functions have a certain charm to them, which is why I often still use them when writing code quickly. The option fixed = TRUE is one example. Wrapping fixed() around the pattern feels just a little awkward to me. Other examples would be the option value = TRUE in grep (I let you figure that one out yourself) and finally ignore.case = TRUE which, again looks a little awkward in stringr:

str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#>  [1]  1  5  6  8  9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97

However, the reason this is awkward for me is probably just because I used base R for a while before learning stringr.

Another point to consider is that with stringi, you have even more features overall. So if you are determined to get into string manipulation, you might start to learn that package right away - although there are admittedly less tutorials and it might be a bit tougher to figure some things out.

Thank you this was super helpful and gave me a lot to think and read about moving forward! — Jeffrey Brabec, Aug 08 '19 at 16:21
I cannot reproduce your benchmarks. For both str_detect and str_which, I see no significant difference between them and their base counterparts. Using `fixed()` is faster than `fixed = TRUE`; however setting `perl = TRUE` in base is much faster than any of the stringr versions. On Windows, R 4.0.3. — Hugh, Oct 25 '20 at 02:31
That's quite strange. I repeated the benchmarks just now and get pretty much the same results under R 4.0.3, `stringr` 1.4.0. Also tested it on rstudio.cloud in case my computer was doing something weird. I can confirm though that `perl = TRUE` changes the picture. I never really use it though. Maybe there is a hidden trade-off? — JBGruber, Oct 25 '20 at 08:26
In the first part of your post including the first code block you write about the packages `stringi` and `stringr`, but the way it is written now sounds like you might have mixed the packages up. — saQuist, Oct 29 '21 at 11:33
I added a small clarification, but I don't know what you mean. stringr is built on top of stringi. Not the other way around... — JBGruber, Oct 29 '21 at 15:56

What's the difference between the str_detect function in stringer and grepl and grep?

1 Answers1

Linked