6

I have a list that contains multiple strings for each observation (see below).

  [1] A, C, D 
  [2] P, O, E
  [3] W, E, W
  [4] S, B, W

I want to test if the strings contain certain substrings and if so, return the respective substring, in this example this would be either "A" or "B" (see desired outcome below). Each observation will only contain either one of the 2 substrings (A|B)

  [1] A 
  [2] NA
  [3] NA
  [4] B

No I have made this attempt in solving it, but it seems very inefficient and also I do not get it to work. How could I solve it?

  if (i == "A") {
    type <- "A"
  } else if { (i == "B") 
    type <- "B" 
  } else { type <- "NA"
  } 

Note: I will need to loop it through > 1000 observations

Carolin
  • 539
  • 2
  • 7
  • 15
  • you can use `grepl` to find strings that have the alphabets you're looking for. For each string where that's true, yo u can use `regexec` and `regmatches` to return the pattern (if any). – Gautam Jun 12 '18 at 14:17

5 Answers5

7

Assume you have a vector of characters, you can use stringr::str_extract for this purpose:

s <- c('A, C, D', 'P, O, E', 'W, E, W', 'S, B, W')
s
# [1] "A, C, D" "P, O, E" "W, E, W" "S, B, W"
stringr::str_extract(s, 'A|B')
# [1] "A" NA  NA  "B"

If a word match is preferred, use word boundaries \\b:

stringr::str_extract(s, '\\b(A|B)\\b')
# [1] "A" NA  NA  "B"

If substring is defined by ", ", you can use this regex (?<=^|, )(A|B)(?=,|$):

# use the test case from G.Grothendieck
stringr::str_extract(c("A.A, C", "D, B"), '(?<=^|, )(A|B)(?=,|$)')
# [1] NA  "B"
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    works perfectly and is very efficient in doing the job, thank you! – Carolin Jun 12 '18 at 15:10
  • Can give wrong answer if strings can be substrings of each other, e.g. e.g. `s <- c("AA, C", "D, B")`. – G. Grothendieck Jun 12 '18 at 15:38
  • @G.Grothendieck Yes.But I think the case depends on what you want the output you want it to be, the shorter or the longer. And sorting the patterns before hand by length should handle it. – Psidom Jun 12 '18 at 15:39
  • In the example in my comment it gives c("A", "B") but the correct answer is c(NA, "B") – G. Grothendieck Jun 12 '18 at 15:41
  • @G.Grothendieck Added the word boundaries. But the case could go pretty complex and unbounded for sure. – Psidom Jun 12 '18 at 15:46
2

without using a package and working only with vectors:

vec <- c('A, C, D', 
         'P, O, E', 
         'W, E, W', 
         'S, B, W')

ifelse(grepl('A', vec), 'A', ifelse(grepl('B', vec), 'B', NA))

You can simplify this further but I left it in the expanded form so you can see how it works.

Gautam
  • 2,597
  • 1
  • 28
  • 51
2

Below we provide strapply and base solutions. The strapply solution is very short but it will not work if the elements to be matched can be substrings of the target; however, they are not substrings in the question so it should work there. The base solution would work even in that case since it uses exact matches rather than regular expressions.

1) strapply (gsubfn) Use strapply in gsubfn. Omit simplify=TRUE if you want a list as output. [AB] can be replaced with A|B if need be.

library(gsubfn)

strapply(x, "[AB]", empty = NA, simplify = TRUE)
## [1] "A" NA  NA  "B"

2) base Split the input and for each element of the split Filter out the matches giving list L. It may be that L is sufficient for your needs but if not then the last line simplifies it to a vector and replaces zero length elements with NA.

L <- lapply(strsplit(x, ", "), Filter, f = function(x) x %in% c("A", "B"))
unlist(replace(L, !lengths(L), NA))
## [1] "A" NA  NA  "B"

Note

x <- c("A, C, D", "P, O, E", "W, E, W", "S, B, W")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

In Base R you can loop over the strings to detect and assign them to an output with [ and <- ([<-).

invec <- c(
  'A, C, D',
  'P, O, E',
  'W, E, W',
  'S, B, W')

out <- rep(NA, length(invec))
for(x in c('A', 'B')) out[grep(x, invec)] <- x
out
#[1] "A" NA  NA  "B"
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
0

If you want to end up with a list, you could use this:

library(magrittr)
x = list(
     c("A", "C", "D"), 
     c("P", "O", "E"),
     c("W", "E", "W"),
     c("S", "B", "W")
     )

myFunction <- function(x){

     x1 <- paste0(x, collapse = "")

     ifelse(stringr::str_detect(x1 , "A|B"), stringr::str_extract(x1, "A|B"), NA)
}

x %>% purrr::map(~ myFunction(.))
TBT8
  • 766
  • 1
  • 6
  • 10