6

I want to use dplyr summarise to sum counts by groups. Specifically I want to remove NA values if not all summed values are NA, but if all summed values are NA, I want to display NA. For example:

name <- c("jack", "jack", "mary", "mary", "ellen", "ellen")
number <- c(1,2,1,NA,NA,NA)

df <- data.frame(name,number)

In this case I want the following result:

  • Jack = 3
  • Mary = 1
  • Ellen = NA

However if I set na.rm = F:

df %>% group_by(name) %>% summarise(number = sum(number, na.rm = F))

The result is:

  • Jack = 3
  • Mary = NA
  • Ellen = NA

And if i set na.rm = T:

df %>% group_by(name) %>% summarise(number = sum(number, na.rm = T))

The result is

  • Jack = 3
  • Mary = 1
  • Ellen = 0

How can I solve this so that the cases with numbers and NA's get a number as output, but the cases with only NA's get NA as output.

Tristan Bakx
  • 61
  • 1
  • 3

2 Answers2

6

We can have a if/else condition - if all the values in 'number are NA, then return NA or else get the sum

library(dplyr)
df %>% 
  group_by(name) %>% 
  summarise(number = if(all(is.na(number))) NA_real_ else sum(number, na.rm = TRUE))
akrun
  • 874,273
  • 37
  • 540
  • 662
6

I was struggling with the same thing, so I wrote a solution into the package hablar. Try:

library(hablar)

df %>% group_by(name) %>% 
  summarise(number = sum_(number))

which gives you:

# A tibble: 3 x 2
  name  number
  <fct>  <dbl>
1 ellen    NA 
2 jack      3.
3 mary      1.

not that the only syntax difference is sum_ which is a function that returns NA if all is NA, else removes NA and calcuules sum no-missing values.

davsjob
  • 1,882
  • 15
  • 10