Why can't I supply str_detect with a column name argument?

Question

I have this toy data as df:

structure(list(Product_Name = c("Delicious Chips", "Creamy Tomato Soup", 
"Cheesy Macaroni", "Savory Meatballs", "Crispy Chicken Tenders"
), Ingredients = c("Potato Slices | Vegetable Oil | Salt | Seasoning Blend", 
"Tomatoes | Water | Cream | Onions | Salt | Spices", "Macaroni | Cheese Sauce | Milk | Butter | Salt | Pepper", 
"Ground Meat | Breadcrumbs | Onions | Garlic | Spices", "Chicken Tenders | Breading Mix | Vegetable Oil | Salt | Pepper"
)), row.names = c(NA, 5L), class = "data.frame")

Here I want to find which rows contain "Salt" in the Ingredients variable.

Using library(tidyverse), initially I try df %>% str_detect(Ingredients, "Salt") but I get Error: object 'Ingredients' not found.

But when I change it to df %>% filter(str_detect(Ingredients, "Salt") it returns a dataframe with the products matching the string.

I thought str_detect needs a character vector or something coercible to one and I thought that Ingredients fit that because when I do class(df$Ingredients) it returns character. Why won't it take Ingredients as an argument and what changes when it is wrapped into filter()?

I'm not sure what your intention is with the `str_detect`, but the function `str_detect` goes inside `mutate`/`filter` or a similar function - not on it's own. E.g. `df %>% mutate(salt_flag = str_detect(Ingredients, "Salt"))` — thelatemail, Aug 17 '23 at 23:18
Thank you, that's good to know, but (again for my own learning) why does this work: `fruit <- c("apple", "banana", "pear", "pineapple") str_detect(fruit, "a")` which is from the str_detect documentation -- and if it should always go into a mutate() or similar, how do I learn that or find it out?! — Jay Bee, Aug 17 '23 at 23:20
`mutate` has a `...` argument which allows all the columns of `df` to be passed through as separate objects, which then can be picked up by `str_detect` when it is nested inside `mutate`. `str_detect` only has arguments for `string=` and `pattern=` and no `...`, so needs a string directly passed in. So you could do something like `df %>% pull(Ingredients) %>% str_detect(pattern="Salt")` to sort it out as well. — thelatemail, Aug 17 '23 at 23:24
Also worth mentioning - without the complication of `%>%` you could also do `str_detect(df$Ingredients, "Salt")` by selecting the column explicitly from the `df` object in the global environment/workspace. And you could assign that back using base R logic then too - `df$salt_flag <- str_detect(df$Ingredients, "Salt")` — thelatemail, Aug 17 '23 at 23:46
I could be persuaded that this should be opened again, but I see any answer people give to be along the same lines as the ones above - you can't do a mutate or filter or something outside of that context, because most functions don't work like that. str_detect takes a string and a pattern as an argument, not a dataframe, a string, and a pattern. While this could be the context for explaining that, it seems like a fairly universal point that I'd be surprised if it hasn't been answered in some way before — Mark, Aug 18 '23 at 14:07

score 2 · Accepted Answer · edited Aug 20 '23 at 19:14

In many Tidyverse (e.g., dplyr) functions, unquoted variables that get passed along to functions use data masking which allow you to use unquoted data variables as if they were variables in the environment. We can see this when we use dplyr::filter:

library(dplyr)

df |> 
  filter(Product_Name == "Savory Meatballs")
#>       Product_Name                                          Ingredients
#> 1 Savory Meatballs Ground Meat | Breadcrumbs | Onions | Garlic | Spices

Here filter is looking for and using the variable "Product_Name" within df, not within your global environment.

However, str_detect, and most of the other functions from the stringr package, do not have this capability. As others have noted, you can nest your str_detect call within mutate or filter to see these results. But if you wanted to just pass along Ingredients to str_detect you can use the with function (more info about with on r-bloggers). This is what that looks like:

library(stringr)

df |>
  with(str_detect(Ingredients, "Salt"))
#> [1]  TRUE  TRUE  TRUE FALSE  TRUE

It does something very similar to what those dplyr functions are doing behind the scenes: rather than looking for a variable named "Ingredients" in your global environment (which is not defined because that is not what you want, you want it to be looking for "Ingredients" within df), it treats the first argument (df) as its own environment and looks for a variable called "Ingredients" in that environment instead.

Why can't I supply str_detect with a column name argument?

1 Answers1