R: count distinct IDs where selected column is non-null

Question

I have the following dataframe:

user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)

df <- data.frame(user_id, event_id, plan_id, treatment_id system)

I would like to count the distinct number of user_id for each column, excluding the NA values. The output I am hoping for is:

      user_id   event_id    plan_id   treatment_id  system
  1   4         4           3         4             2

I tried to leverage mutate_all, but that was unsuccessful because my data frame is too large. In other functions, I've used the following two lines of code to get the nonnull count and the count distinct for each column:

colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))

Optimally, I would like to combine the two with an ifelse to minimize the mutations, as this will ultimately be thrown into a function to be applied with a number of other summary statistics to a list of data frames.

I have tried a brute-force method, where make the values 1 if not null and 0 otherwise and then copy the id to that column if 1. I can then just use the count distinct line from above to get my output. However, I get the wrong values when copying it into the other columns and the number of adjustments is sub optimal. See code:

binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])

I'd greatly appreciate your help.

d.b · Accepted Answer · 2020-12-24T00:31:23.230

1: Base

sapply(df, function(x){
    length(unique(df$user_id[!is.na(x)]))
})
#     user_id     event_id      plan_id treatment_id       system 
#           4            4            3            3            2

2: Base

aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
#           ind user_id
#1      user_id       4
#2     event_id       4
#3      plan_id       3
#4 treatment_id       3
#5       system       2

3: tidyverse

df %>%
    mutate(key = user_id) %>%
    pivot_longer(!key) %>%
    filter(!is.na(value)) %>%
    group_by(name) %>%
    summarise(value = n_distinct(key)) %>%
    pivot_wider()
## A tibble: 1 x 5
#  event_id plan_id system treatment_id user_id
#     <int>   <int>  <int>        <int>   <int>
#1        4       3      2            3       4

hello_friend · Answer 2 · 2020-12-23T23:55:36.273

0

Thanks @dcarlson I had misunderstood the question:

   apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})

edited Dec 23 '20 at 23:55

answered Dec 23 '20 at 23:29

hello_friend

5,682
1
11
15

This does not match the expected results because it is the non-unique values in each column, not the non-unique user_id's in each column. Also there are only 3 unique ids in treatment_id not 4. This should work: `apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})` – dcarlson Dec 23 '20 at 23:44
@dcarlson Thanks I had misunderstood the question. – hello_friend Dec 23 '20 at 23:55
Unfortunately, I get the error: object of type 'closure' is not subsettable. d.b's solution works though – Austin Gilbert Dec 24 '20 at 17:13

score 0 · Answer 3 · answered Dec 24 '20 at 00:08

0

A data.table option with uniqueN

> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
   user_id event_id plan_id treatment_id system
1:       4        4       3            3      2

answered Dec 24 '20 at 00:08

ThomasIsCoding

96,636
9
24
81

score 0 · Answer 4 · answered Dec 24 '20 at 02:16

0

Using dplyr you can use summarise with across :

library(dplyr)
df %>% summarise(across(.fns =  ~n_distinct(user_id[!is.na(.x)])))

#  user_id event_id plan_id treatment_id system
#1       4        4       3            3      2

answered Dec 24 '20 at 02:16

Ronak Shah

377,200
20
156
213

R: count distinct IDs where selected column is non-null

4 Answers4