1

I'm asking this question more out of curiosity. I was able to achieve my desired results using the filter() function but I'm interested in the explanation for the scenario below.

I wanted to use filter() to filter out values with multiple conditions using the != operator. I first tried using OR "|" but it wasn't correctly filtering out the values. It instead returned all the data back seemingly unfiltered. However, it worked when I used "&" instead (see below).

Ex.

data %>% 
  filter(SampleTypeName != "Grab" & 
         SampleTypeName != "Composite" & 
         SampleTypeName != "Integrated" & 
         SampleTypeName != "Not Applicable")

When I wanted to basically do the opposite, I filtered for values equal to the same set of strings above. I intuitively thought using "&" was also the solution. It instead returned all the data back seemingly unfiltered as well. Turns out, to achieve my desired results, I had to use "|" instead.

Ex.

data %>% 
  filter(SampleTypeName == "Grab" | 
         SampleTypeName == "Composite" | 
         SampleTypeName == "Integrated" |
         SampleTypeName == "Not Applicable")

Why is this the case? I would appreciate both a semi-in-depth explanation and an explanation like I'm five :)

Thanks

neilfws
  • 32,751
  • 5
  • 50
  • 63
nps-randy
  • 13
  • 4
  • 1
    I believe the syntax you want is: `data %>% filter(!SampleTypeName %in% c("Grab", "Composite", "Integrated", "Not Applicable"))` – Jon Spring Feb 22 '23 at 23:39

2 Answers2

0

The first thing to know about dplyr::filter is that multiple expressions can be combined by a comma, which is the same as using "&". For example:

data %>% 
  filter(SampleTypeName != "Grab", SampleTypeName != "Composite")

However as indicated in the comment from @jon-spring, the better way to test for multiple string values is to use %in%. So your first example becomes:

data %>% 
  filter(!SampleTypeName %in% c("Grab", "Composite", "Integrated", "Not Applicable"))

And the second example, you simply remove the negation.

data %>% 
  filter(SampleTypeName %in% c("Grab", "Composite", "Integrated", "Not Applicable"))

As to why "|" did not work as expected in the first example: it is because every row is evaluated against each condition. Let's say your variable, x can take values of "a" or "b". Now ask the question: "is x not equal to a OR is x not equal to b". The first part of that question will keep rows where x is "b" and the second part of that question will keep rows where x is "a". In other words, all rows are kept, as you observed.

neilfws
  • 32,751
  • 5
  • 50
  • 63
0

This is not a quirk of dplyr - this is an application of DeMorgan's Law in propositional logic which states that ¬A∧¬B ⇔ ¬(A∨B). In plain english, this rule states that not A and not B = not (A or B). In R terms, this means that !(a & b) == !a | !b and conversely, !(a | b) == !a & !b. Check out this question for a great, easy-to-understand explanation.

In your specific case, you are combining lots of negations with & which looks like !A & !B & !C & !D. By DeMorgan's Law, we know that this is equal to !(A | B | C | D). To get the negation of this, we simply remove the !, so we are left with A | B | C | D.

mfg3z0
  • 561
  • 16