1

Let's say I have this dataframe

> df
   mr_daterd mr_daterd_fu1 mr_daterd_fu2
1 2018-03-05    2018-03-05          <NA>
2 2019-05-04          <NA>    2020-03-05
3 2020-01-03    2020-06-06    2021-04-02

Each row represent a patient and the dates represent MRI scans. I want to count the number of MRI scans per row, i.e. rowSums() of non-missing values. However, some patient had several scans at the same date. Therefore, the rowSums() should only count unique non-missing values.

Eg, using

df_new <- df %>%
  mutate(
n_mri = rowSums(!is.na(select(., contains('mr_daterd'))))
)

Gives

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     2
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

The n_mri for row 1 should be 1, and not 2, because 2018-03-05 is duplicated in mr_daterd and mr_daterd_fu1.

Expected output:

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     1
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

Data

df <- structure(list(mr_daterd = structure(c(17595, 18020, 18264), class = "Date"), 
    mr_daterd_fu1 = structure(c(17595, NA, 18419), class = "Date"), 
    mr_daterd_fu2 = structure(c(NA, 18326, 18719), class = "Date")), class = "data.frame", row.names = c(NA, 
-3L))
cmirian
  • 2,572
  • 3
  • 19
  • 59

4 Answers4

3

dplyr solution using n_distinct and c_across.

df %>% 
  rowwise %>% 
  mutate(n_mri = n_distinct(
    c_across(contains('mr_daterd')), 
    na.rm=TRUE)) %>%
  ungroup()


# A tibble: 3 × 4
# Rowwise: 
  mr_daterd  mr_daterd_fu1 mr_daterd_fu2 n_mri
  <date>     <date>        <date>        <int>
1 2018-03-05 2018-03-05    NA                1
2 2019-05-04 NA            2020-03-05        2
3 2020-01-03 2020-06-06    2021-04-02        3
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Adam Quek
  • 6,973
  • 1
  • 17
  • 23
2

With base R, you could use apply():

apply(df, 1, \(x) sum(!is.na(unique(x))))

# [1] 1 2 3
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
1

Another possible solution, based on purrr::pmap:

library(tidyverse)

df %>% 
  mutate(n_mri = pmap_int(., ~ n_distinct(c(...), na.rm = T)))

#>    mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
#> 1 2018-03-05    2018-03-05          <NA>     1
#> 2 2019-05-04          <NA>    2020-03-05     2
#> 3 2020-01-03    2020-06-06    2021-04-02     3
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • 1
    It should be `pmap_int()`, otherwise `n_mri` will be a list-column. – Darren Tsai Jul 16 '22 at 08:57
  • Thanks, @DarrenTsai, for having commented my solution! Could you please tell me what is the difference, in practice, to have a list-column or a vector of integers? – PaulS Jul 16 '22 at 09:01
  • 1
    Got it now, @DarrenTsai! That is something very subtle: for instance, if we try to subtract a number to a list-column of integers, we get `numeric(0)`, while without a list-column, we get the correct values. Thanks a lot for having called my attention to that! – PaulS Jul 16 '22 at 09:27
0

An option with collapse

library(collapse)
dapply(df, MARGIN = 1, FUN = fndistinct)
[1] 1 2 3
akrun
  • 874,273
  • 37
  • 540
  • 662