rowSums() to count number of both non-missing and unique values

Question

Let's say I have this dataframe

> df
   mr_daterd mr_daterd_fu1 mr_daterd_fu2
1 2018-03-05    2018-03-05          <NA>
2 2019-05-04          <NA>    2020-03-05
3 2020-01-03    2020-06-06    2021-04-02

Each row represent a patient and the dates represent MRI scans. I want to count the number of MRI scans per row, i.e. rowSums() of non-missing values. However, some patient had several scans at the same date. Therefore, the rowSums() should only count unique non-missing values.

Eg, using

df_new <- df %>%
  mutate(
n_mri = rowSums(!is.na(select(., contains('mr_daterd'))))
)

Gives

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     2
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

The n_mri for row 1 should be 1, and not 2, because 2018-03-05 is duplicated in mr_daterd and mr_daterd_fu1.

Expected output:

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     1
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

Data

df <- structure(list(mr_daterd = structure(c(17595, 18020, 18264), class = "Date"), 
    mr_daterd_fu1 = structure(c(17595, NA, 18419), class = "Date"), 
    mr_daterd_fu2 = structure(c(NA, 18326, 18719), class = "Date")), class = "data.frame", row.names = c(NA, 
-3L))

score 3 · Accepted Answer · edited Jul 16 '22 at 07:17

3

dplyr solution using n_distinct and c_across.

df %>% 
  rowwise %>% 
  mutate(n_mri = n_distinct(
    c_across(contains('mr_daterd')), 
    na.rm=TRUE)) %>%
  ungroup()


# A tibble: 3 × 4
# Rowwise: 
  mr_daterd  mr_daterd_fu1 mr_daterd_fu2 n_mri
  <date>     <date>        <date>        <int>
1 2018-03-05 2018-03-05    NA                1
2 2019-05-04 NA            2020-03-05        2
3 2020-01-03 2020-06-06    2021-04-02        3

edited Jul 16 '22 at 07:17

Darren Tsai

32,117
5
21
51

answered Jul 16 '22 at 07:04

Adam Quek

6,973
1
17
23

1

`length(unique(x))` is equivalent to `n_distinct(x)` – Darren Tsai Jul 16 '22 at 07:06
1

Thanks for the improvement! I get to learn something new too! – Adam Quek Jul 16 '22 at 07:08

score 2 · Answer 2 · answered Jul 16 '22 at 07:03

2

With base R, you could use apply():

apply(df, 1, \(x) sum(!is.na(unique(x))))

# [1] 1 2 3

answered Jul 16 '22 at 07:03

Darren Tsai

32,117
5
21
51

How can I integrate this into my `mutate`-pipe? – cmirian Jul 16 '22 at 07:09
1

@cmirian something like `df %>% mutate(n_mri = apply(., 1, \(x) sum(!is.na(unique(x)))))` – Darren Tsai Jul 16 '22 at 07:10

PaulS · Answer 3 · 2022-07-16T09:27:51.833

1

Another possible solution, based on purrr::pmap:

library(tidyverse)

df %>% 
  mutate(n_mri = pmap_int(., ~ n_distinct(c(...), na.rm = T)))

#>    mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
#> 1 2018-03-05    2018-03-05          <NA>     1
#> 2 2019-05-04          <NA>    2020-03-05     2
#> 3 2020-01-03    2020-06-06    2021-04-02     3

edited Jul 16 '22 at 09:27

answered Jul 16 '22 at 08:47

PaulS

21,159
2
9
26

1

It should be `pmap_int()`, otherwise `n_mri` will be a list-column. – Darren Tsai Jul 16 '22 at 08:57
Thanks, @DarrenTsai, for having commented my solution! Could you please tell me what is the difference, in practice, to have a list-column or a vector of integers? – PaulS Jul 16 '22 at 09:01
1

Got it now, @DarrenTsai! That is something very subtle: for instance, if we try to subtract a number to a list-column of integers, we get `numeric(0)`, while without a list-column, we get the correct values. Thanks a lot for having called my attention to that! – PaulS Jul 16 '22 at 09:27

score 0 · Answer 4 · answered Jul 16 '22 at 16:49

0

An option with collapse

library(collapse)
dapply(df, MARGIN = 1, FUN = fndistinct)
[1] 1 2 3

answered Jul 16 '22 at 16:49

akrun

874,273
37
540
662

rowSums() to count number of both non-missing and unique values

4 Answers4