1

Using a vector (a column of a data frame) of strings, I'm trying to identify the string from which an excerpt of a string comes.

In the following example, excerpt_of_string is an excerpt (specifically the first 119 characters) from the second element in vector_of_strings:

excerpt_of_string <- "Considering utilizing eLearning days for snow make-up? Join us on 12/8 for Snow day, sNOw problem! Details https://t.co"

vector_of_strings <- c("Meow", 
                       "Considering utilizing eLearning days for snow make-up? Join us on 12/8 for Snow day, sNOw problem! Details https://t.co/LfbPne3uuo #INeLearn", 
                       "Bark")

I first tried to use grepl, anticipating that the second element of vector_of_strings would be TRUE, but all the elements were false:

grepl(excerpt_of_string, vector_of_strings)
[1] FALSE FALSE FALSE

I also tried str_detect from the stringr package:

stringr::str_detect(vector_of_strings, excerpt_of_string)
[1] FALSE FALSE FALSE

Why are these methods not detecting the excerpt excerpt_of_string in the second element of vector_of_strings?

Joshua Rosenberg
  • 4,014
  • 9
  • 34
  • 73
  • Use `stringr::fixed(excerpt_of_string)` instead of `excerpt_of_string`. – nrussell Dec 29 '15 at 19:05
  • That works, thanks (I used `stringr::str_detect(vector_of_strings, stringr::fixed(excerpt_of_string))`). The help says that is used to "Compare literal bytes in the string." can you help me understand what that means / why this works? – Joshua Rosenberg Dec 29 '15 at 19:07
  • 1
    I don't have time to track down the specific point in your `pattern` string where things break down, but it's almost certainly due to the fact that it contains character like `?`, `!`, `:`, and `.`, which generally aren't interpreted literally by regex engines. You need `fixed = TRUE` (`grepl`) or `fixed(...)` (`stringr`) to search for literal character strings. – nrussell Dec 29 '15 at 19:15
  • Read the **Extended Regular Expressions** section in `?regex`, particularly *"The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context."* – nrussell Dec 29 '15 at 19:17

1 Answers1

4

It's not detecting because of the metacharacters that reside in your string.

You can treat the entire string pattern as a literal using the fixed=TRUE parameter.

grepl(excerpt_of_string, vector_of_strings, fixed=TRUE)
# [1] FALSE  TRUE FALSE

Or \Q ... \E, which can be used to ignore metacharacters in the pattern as well.

grepl(paste0('\\Q', excerpt_of_string, '\\E'), vector_of_strings)
# [1] FALSE  TRUE FALSE
hwnd
  • 69,796
  • 4
  • 95
  • 132