16

I'm trying to combine dplyr and stringr to detect multiple patterns in a dataframe. I want to use dplyr as I want to test a number of different columns.

Here's some sample data:

test.data <- data.frame(item = c("Apple", "Bear", "Orange", "Pear", "Two Apples"))
fruit <- c("Apple", "Orange", "Pear")
test.data
        item
1      Apple
2       Bear
3     Orange
4       Pear
5 Two Apples

What I would like to use is something like:

test.data <- test.data %>% mutate(is.fruit = str_detect(item, fruit))

and receive

        item is.fruit
1      Apple        1
2       Bear        0
3     Orange        1
4       Pear        1
5 Two Apples        1

A very simple test works

> str_detect("Apple", fruit)
[1]  TRUE FALSE FALSE
> str_detect("Bear", fruit)
[1] FALSE FALSE FALSE

But I can't get this to work over the column of the dataframe, even without dplyr:

> test.data$is.fruit <- str_detect(test.data$item, fruit)
Error in check_pattern(pattern, string) : 
  Lengths of string and pattern not compatible

Does anyone know how to do this?

r.bot
  • 5,309
  • 1
  • 34
  • 45

4 Answers4

27

str_detect only accepts a length-1 pattern. Either turn it into one regex using paste(..., collapse = '|') or use any:

sapply(test.data$item, function(x) any(sapply(fruit, str_detect, string = x)))
# Apple       Bear     Orange       Pear Two Apples
#  TRUE      FALSE       TRUE       TRUE       TRUE

str_detect(test.data$item, paste(fruit, collapse = '|'))
# [1]  TRUE FALSE  TRUE  TRUE  TRUE
Robert Krzyzanowski
  • 9,294
  • 28
  • 24
14

This simple approach works fine for EXACT matches:

test.data %>% mutate(is.fruit = item %in% fruit)
# A tibble: 5 x 2
        item is.fruit
       <chr>    <lgl>
1      Apple     TRUE
2       Bear    FALSE
3     Orange     TRUE
4       Pear     TRUE
5 Two Apples    FALSE

This approach works for partial matching (which is the question asked):

test.data %>% 
rowwise() %>% 
mutate(is.fruit = sum(str_detect(item, fruit)))

Source: local data frame [5 x 2]
Groups: <by row>

# A tibble: 5 x 2
        item is.fruit
       <chr>    <int>
1      Apple        1
2       Bear        0
3     Orange        1
4       Pear        1
5 Two Apples        1
Henrik
  • 1,101
  • 9
  • 7
  • This only works if there are exact matches, in which case using `str_detect` rather than `==` or `in` is superfluous. – Alex Gold Aug 31 '17 at 15:57
  • Ah, your are right, Alex. I read the question a bit fast, I guess. I have updated the answer. – Henrik Sep 11 '17 at 13:50
0

Using the map functions from purrr can simplify this further for convenient use in a pipe and format control - map_int returns numeric, map_lgl returns logical.

library(purrr)

test.data %>%
    mutate(is.fruit = map_int(item, ~any(str_detect(., fruit))))

        item is.fruit
1       Apple     1
2        Bear     0
3      Orange     1
4        Pear     1
5  Two Apples     1
GGAnderson
  • 1,993
  • 1
  • 14
  • 25
0

An alternate solution where you can filter only the rows that have those specific strings (or fruits in your case) could be to use:

test.data %>%
  filter(str_detect(item, "Apple|Orange|Pear"))

The output will be

item
Apple
Orange
Pear
Two Apples
Sandy
  • 1,100
  • 10
  • 18