R strsplit and filtering based on sub-indices matching in the logical vectors

Question

I want to apply strsplit in such a fashion that if there exist an identical pair of values with & (e.g. this is one pair with & (NINA & SAM)) and also with | (e.g. this is another pair but with | (NINA | SAM)) then keep the one with &

Below are 2 possible cases, and the length of these vectors (vec1, vec2) might vary among actual cases.

Case 1

> vec1
[1] "((PAUL & SAM) | (PAUL | SAM) | (NINA & SAM) | (NINA | SAM) | (NINA & PAUL) | (NINA | PAUL))"
> vec2
[1] "((PAUL | SAM) & (PAUL & SAM) & (NINA | SAM) & (NINA | PAUL) & (NINA & PAUL) & (NINA & SAM))"

Case 2

> vec1
[1] "((PAUL | SAM) | (PAUL & SAM) | (!NINA & SAM) | (!NINA & PAUL))"
> vec2
[1] "((PAUL | SAM) & (PAUL & SAM) & (!NINA & SAM) & (!NINA & PAUL))"

This should be the outputs:

Case 1

> vec1
[1] "((PAUL & SAM) | (NINA & SAM) | (NINA & PAUL))"
> vec2
[1] "((PAUL & SAM) & (NINA & PAUL) & (NINA & SAM))"

Case 2

> vec1
[1] "((PAUL & SAM) | (!NINA & SAM) | (!NINA & PAUL))"
> vec2
[1] "((PAUL & SAM) & (!NINA & SAM) & (!NINA & PAUL))"

What I have tried so far:

My idea was to first remove the (( and )) from the start and end of the vector then split the vec1 on ") | (" and vec2 on ") & (" . Then further split the indices on space*space and check if sub index 1 and 2 matches with any other sub-index, if yes then keep the one which has &. Then put everything back together. I have a limited knowledge of R and I was unable to implement what I have in my mind. Any help will be much appreciated!

Rui Barradas · Answer 1 · 2018-04-27T11:56:29.637

I believe the following does what you want.
It is not very pretty but the outputs are correct.

keepAmpersand <- function(x) {
    y <- sub("\\(\\(", "(", x)  # get rid of the double
    y <- sub("\\)\\)", ")", y)  # parenthesis
    # this regex is meant to replace either a '|' or a '&'

    # with the same character between '#' (one '#' on each side)  
    y <- gsub("(\\) \\| \\(|\\) & \\()", ")#\\1#(", y)

    # now use that special pattern, '# five chars #' to split
    y <- unlist(strsplit(y, "#.{5}#"))

    # keep the ones with the ampersand or with just one name
    y <- grep("&|\\([[:alpha:]]+\\)", y, value = TRUE)
    y <- paste0("(", paste(y, collapse = " | "), ")")    # reassemble
    y
}

Now apply the function to each of the cases.

Case 1

vec1 <-
"((PAUL & SAM) | (PAUL | SAM) | (NINA & SAM) | (NINA | SAM) | (NINA & PAUL) | (NINA | PAUL))"
vec2 <-
"((PAUL | SAM) & (PAUL & SAM) & (NINA | SAM) & (NINA | PAUL) & (NINA & PAUL) & (NINA & SAM))"


keepAmpersand(vec1)
#[1] "((PAUL & SAM) | (NINA & SAM) | (NINA & PAUL))"

keepAmpersand(vec2)
#[1] "((PAUL & SAM) | (NINA & PAUL) | (NINA & SAM))"

Case 2

vec1 <-
"((PAUL | SAM) | (PAUL & SAM) | (!NINA & SAM) | (!NINA & PAUL))"
vec2 <- 
"((PAUL | SAM) & (PAUL & SAM) & (!NINA & SAM) & (!NINA & PAUL))"


keepAmpersand(vec1)
#[1] "((PAUL & SAM) | (!NINA & SAM) | (!NINA & PAUL))"

keepAmpersand(vec2)
#[1] "((PAUL & SAM) | (!NINA & SAM) | (!NINA & PAUL))"

Case 3: case when there is just one name between parenthesis.

vec3 <-
"((PAUL | SAM) & (PAUL & SAM) & (NINA | SAM) & (NINA | PAUL) & (NINA & PAUL) & (NINA))"

keepAmpersand(vec3)
#[1] "((PAUL & SAM) | (NINA & PAUL) | (NINA))"

Thanks a lot for taking out the time to help me. The final output is missing the link in between the brackets, e.g. `"(PAUL & SAM)" "(!NINA & SAM)"` should be `"(PAUL & SAM)" | "(!NINA & SAM)"` for the Case2 `vec1`. Secondly, the extra opening and closing brackets are also missing. Can you please comment the code so that I can easily understand what every line is doing. Thanks a lot again! — Newbie, Apr 27 '18 at 09:15
@Newbie See if this is it. Note that it is just `"(PAUL & SAM) | (!NINA & SAM)"` without the quotes near the bar `|`. — Rui Barradas, Apr 27 '18 at 09:42
Thanks for the help. Just one little thing is that it is not working in the cases when parenthesis have only one name e.g. `"((PAUL | SAM) & (PAUL & SAM) & (NINA | SAM) & (NINA | PAUL) & (NINA & PAUL) & (NINA))"` In this case it will not report the last `(NINA)` . I am sorry I did not put such constraint in the input data sample, but actually I do have input like that. — Newbie, Apr 27 '18 at 10:20
I'm sorry if I am causing inconvenience fro you but in Case3 `(NINA | SAM)` is dropped out but originally it should not. As there is no other index which has these names. Can you please explain it. Thanks! — Newbie, Apr 27 '18 at 13:01

R strsplit and filtering based on sub-indices matching in the logical vectors

1 Answers1