7

I have a sequence of variables in a dataframe (over 100) and I would like to create an indicator variable for if particular text patterns are present in any of the variables. Below is an example with three variables. One solution I've found is using tidyr::unite() followed by dplyr::mutate(), but I'm interested in a solution where I do not have to unite the variables.

c1<-c("T1", "X1", "T6", "R5")
c2<-c("R4", "C6", "C7", "X3")
c3<-c("C5", "C2", "X4", "T2")

df<-data.frame(c1, c2, c3)

  c1 c2 c3
1 T1 R4 C5
2 X1 C6 C2
3 T6 C7 X4
4 R5 X3 T2

code.vec<-c("T1", "T2", "T3", "T4") #Text patterns of interest
code_regex<-paste(code.vec, collapse="|")

new<-df %>% 
  unite(all_c, c1:c3, remove=FALSE) %>% 
  mutate(indicator=if_else(grepl(code_regex, all_c), 1, 0)) %>% 
  select(-(all_c))

  c1 c2 c3 indicator
1 T1 R4 C5 1
2 X1 C6 C2 0
3 T6 C7 X4 0
4 R5 X3 T2 1

Above is an example that produces the desired result, however I feel as if there should be a way of doing this in tidyverse without having to unite the variables. This is something that SAS handles very easily using an ARRAY statement and a DO loop, and I'm hoping R has a good way of handling this.

The real dataframe has many additional variables besides from the "c" fields to search, so a solution that involves searching every column would require subsetting the dataframe to first only contain the variables I want to search, and then joining the data back with the other variables.

  • You said you don't want to use `unite`, but it's worth noting that passing the argument `remove = FALSE` has `unite` create a column of the united variables leaving the others intact. Might be convenient in this case. – camille Apr 22 '19 at 14:55
  • Yes, it is convenient. And it does work. I just feel like there may be a simpler approach I'm missing that doesn't need to create a united variable. – patward5656 Apr 22 '19 at 15:06

3 Answers3

6

Using base R, we can use sapply and use grepl to find pattern in every column and assign 1 to rows where there is more than 0 matches.

df$indicator <- as.integer(rowSums(sapply(df, grepl, pattern = code_regex)) > 0)

df
#  c1 c2 c3 indicator
#1 T1 R4 C5         1
#2 X1 C6 C2         0
#3 T6 C7 X4         0
#4 R5 X3 T2         1

If there are few other columns and we are interested to apply it only for columns which start with "c" we can use grep to filter them.

cols <- grep("^c", names(df))
as.integer(rowSums(sapply(df[cols], grepl, pattern = code_regex)) > 0)

Using dplyr we can do

library(dplyr)

df$indicator <- as.integer(df %>%
              mutate_at(vars(c1:c3), ~grepl(code_regex, .)) %>%
              rowSums() > 0)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This is a good solution, but in the real data there are additional variables that I do not want to pattern search, so this would require me to index the dataframe to include only the columns I want to search first. Will edit my original post to include this information. – patward5656 Apr 22 '19 at 14:37
  • The purr solution looks like what I was looking for--one line of code that doesn't involve uniting the variables. – patward5656 Apr 22 '19 at 14:42
  • @patward5656 I think the `purrr` solution would not give you the expected output. I changed it to use `mutate_at` which should work on range of columns. Moreover, you can use column numbers directly in `cols` for `sapply` ., say columns `3:5` or `1:3` to find pattern in those column. – Ronak Shah Apr 22 '19 at 14:52
3

We can use tidyverse

library(tidyverse)
df %>%
    mutate_all(str_detect, pattern = code_regex) %>%
    reduce(`+`) %>% 
    mutate(df, indicator = .)
#  c1 c2 c3 indicator
#1 T1 R4 C5         1
#2 X1 C6 C2         0
#3 T6 C7 X4         0
#4 R5 X3 T2         1

Or using base R

Reduce(`+`, lapply(df, grepl, pattern = code_regex))
#[1] 1 0 0 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This `tidyverse` solution seems to only work in the scenario where all of the columns are being searched. I have other variables in my real dataset, and when using it for that the output is all NA. Does this have something to do with the `reduce` function? – patward5656 Apr 22 '19 at 15:40
  • @patward5656 That is an easy fix. `df %>% mutate_at(vars(starts_with("c")), str_detect, pattern = code_regex) %>% reduce("+") %>% mutate(df, indicator = .)` – akrun Apr 22 '19 at 15:41
  • `c1<-c("T1", "X1", "T6", "R5") c2<-c("R4", "C6", "C7", "X3") c3<-c("C5", "C2", "X4", "T2") z1<-c("C5", "C2", "X4", "T2") df<-data.frame(c1, c2, c3, z1) df %>% mutate_at(vars(starts_with("c")), str_detect, pattern = code_regex) %>% reduce(`+`) %>% mutate(df, indicator = .) c1 c2 c3 z1 indicator 1 T1 R4 C5 C5 NA 2 X1 C6 C2 C2 NA 3 T6 C7 X4 X4 NA 4 R5 X3 T2 T2 NA Warning message: In Ops.factor(.x, .y) : ‘+’ not meaningful for factors ` This produced NAs, it seems. – patward5656 Apr 22 '19 at 15:59
  • 1
    @patward5656 I would use `transmute_at` instead of `mutate_at` `df %>% transmute_at(vars(starts_with("c")), str_detect, pattern = code_regex) %>% reduce(`+`)` – akrun Apr 22 '19 at 16:08
  • You're right. I changed them to `chr`, and now the `reduce` function produces this error: `Error in .x + .y : non-numeric argument to binary operator` – patward5656 Apr 22 '19 at 16:09
  • @patward5656 Use `transmute_at` as I commented above. It works for me `df %>% transmute_at(vars(starts_with("c")), str_detect, pattern = code_regex) %>% reduce(`+`)# [1] 1 0 0 1` – akrun Apr 22 '19 at 16:10
  • 1
    Thanks. I believe `transmute_at()` solves it perfectly. – patward5656 Apr 22 '19 at 16:10
1

Base R with apply

apply(df[cols], 1, function(x) sum(grepl(code_regex, x)))
# [1] 1 0 0 1
cropgen
  • 1,920
  • 15
  • 24