How to use an "or" conditional in an n_distinct function, in dplyr?

Question

Suppose we start with this data frame:

mydat <- 
  data.frame(
    ID = c(115,115,115,88,88,88,100,100),
    Period = c(1, 2, 3, 1, 2, 3, 1, 2),
    Status_1 = c(1,2,1,1,2,3,2,1),
    Status_2 = c("Open","Open","Terminus","Open","Open","Closed","Open","Open")
  )

> mydat
   ID Period Status_1 Status_2
1 115      1        1     Open
2 115      2        2     Open
3 115      3        1 Terminus
4  88      1        1     Open
5  88      2        2     Open
6  88      3        3   Closed
7 100      1        2     Open
8 100      2        1     Open

Next, we run the following dplyr grouping to sum the number of instances by Period and Status_1, where Status_2 = "Open":

mydat %>%
  group_by(Period,Status_1) %>%
  summarize(StatusCount = n_distinct(ID[Status_2 == "Open"]))

  Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           0
6      3        3           0

I've been trying to expand the above n_distinct() function to also include Status_2 = "Terminus" (in addition to the "Open" in the above code). I've tried various iterations of "or" conditions, and summing tricks, with no luck yet. Any ideas how to do this?

The result, including the Status_2 = "Terminus", would look like this:

Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           1
6      3        3           0

`n_distinct(ID[Status_2 %in% c("Open", "Terminus")])` – r2evans Jan 10 '22 at 13:15 — r2evans, Jan 10 '22 at 13:15

score 1 · Accepted Answer · answered Jan 10 '22 at 13:12

This might work for you. I added a conditional/logical or in the selection.

mydat %>% 
      group_by(Period,Status_1) %>% 
      summarize(StatusCount = n_distinct(ID[Status_2 == "Open"|Status_2 == "Terminus"])) %>% 
      ungroup()
`summarise()` has grouped output by 'Period'. You can override using the `.groups` argument.
# A tibble: 6 x 3
  Period Status_1 StatusCount
   <dbl>    <dbl>       <int>
1      1        1           2
2      1        2           1
3      2        1           1
4      2        2           2
5      3        1           1
6      3        3           0

Thank you! I should have been able to solve that, I tried your solution (or something close) before posting but I must have made some other mistake or typo — Curious Jorge - user9788072, Jan 10 '22 at 13:30

score 1 · Answer 2 · answered Jan 10 '22 at 13:17

df <- 
  data.frame(
    ID = c(115,115,115,88,88,88,100,100),
    Period = c(1, 2, 3, 1, 2, 3, 1, 2),
    Status_1 = c(1,2,1,1,2,3,2,1),
    Status_2 = c("Open","Open","Terminus","Open","Open","Closed","Open","Open")
  )

library(tidyverse)
df %>%
  group_by(Period, Status_1) %>%
  summarize(StatusCount = n_distinct(ID[Status_2 %in% c("Terminus", "Open")]), .groups = "drop")
#> # A tibble: 6 x 3
#>   Period Status_1 StatusCount
#>    <dbl>    <dbl>       <int>
#> 1      1        1           2
#> 2      1        2           1
#> 3      2        1           1
#> 4      2        2           2
#> 5      3        1           1
#> 6      3        3           0

^{Created on 2022-01-10 by the reprex package (v2.0.1)}

How to use an "or" conditional in an n_distinct function, in dplyr?

2 Answers2