Dplyr summarize with which.max and data with NA's

Question

I am working with a data set of changes over time and need to calculate the time at which the peak change occurs. I am running into a problem because some subjects have missing data (NA's).

Example:

library(dplyr)

Data <- structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L), .Label = c("1", "10", "11", "12", "13", "14", "16", 
"17", "18", "19", "2", "20", "21", "22", "23", "24", "25", "26", 
"27", "28", "29", "3", "31", "32", "4", "5", "7", "8", "9"), class = "factor"), 
Close = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L
), .Label = c("High Predictability", "Low Predictability"
), class = "factor"), SOA = structure(c(2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L), .Label = c("Long SOA", "Short SOA"), class = "factor"), 
Time = c(-66.68, -66.68, -66.68, -66.68, -33.34, -33.34, 
-33.34, -33.34, 0, 0, 0, 0, 33.34, 33.34, 33.34, 33.34, 66.68, 
66.68, 66.68, 66.68, -66.68, -66.68, -66.68, -66.68, -33.34, 
-33.34, -33.34, -33.34, 0, 0, 0, 0, 33.34, 33.34, 33.34, 
33.34, 66.68, 66.68, 66.68, 66.68), Pcent_Chng = c(0.12314, 
0.048254, -0.098007, 0.023216, 0.20327, 0.08338, -0.15157, 
0.030008, 0.26442, 0.12019, -0.22878, 0.035547, 0.31849, 
0.15488, -0.26887, 0.038992, 0.39489, 0.15112, -0.31185, 
0.02144, NA, 0.046474, NA, 0.17541, NA, 0.14975, NA, 0.3555, 
NA, -0.1736, NA, 0.72211, NA, -0.32201, NA, 1.0926, NA, -0.39551, 
0.72211, 1.4406)), class = "data.frame", row.names = c(NA, -40L
), .Names = c("Subject", "Close", "SOA", "Time", "Pcent_Chng"
))

I get an error with the following attempt:

Data %>%
group_by(Subject,Close,SOA) %>%
summarize(Peak_Pcent = max(Pcent_Chng), 
                    Peak_Latency = Time[which.max(Pcent_Chng)])

The error is:

Error in summarise_impl(.data, dots) : 
  Column `Peak_Latency` must be length 1 (a summary value), not 0

This seems to be due to the NA's, which are only in some SOA conditions. Using complete.cases() with my actual data is too aggressive and removes too much data.

Is there a workaround to ignore the NA's?

score 1 · Accepted Answer · answered Nov 09 '17 at 23:29

You have one group with Peak_Pcent all is NA, and the other group only with one Peak_Pcent. I think it is better to filter out the group with Peak_Pcent all is NA, and set na.rm = TRUE when using the max function.

Data %>%
  group_by(Subject,Close,SOA) %>%
  filter(!all(is.na(Pcent_Chng))) %>% # Filter out groups with Pcent_Chng all is NA
  summarize(Peak_Pcent = max(Pcent_Chng, na.rm = TRUE),  # Set na.rm = TRUE
            Peak_Latency = Time[which.max(Pcent_Chng)]) 

# # A tibble: 7 x 5
# # Groups:   Subject, Close [?]
# Subject               Close       SOA Peak_Pcent Peak_Latency
# <fctr>              <fctr>    <fctr>      <dbl>        <dbl>
# 1       1 High Predictability  Long SOA   0.154880        33.34
# 2       1 High Predictability Short SOA   0.394890        66.68
# 3       1  Low Predictability  Long SOA   0.038992        33.34
# 4       1  Low Predictability Short SOA  -0.098007       -66.68
# 5      14 High Predictability  Long SOA   0.149750       -33.34
# 6      14  Low Predictability  Long SOA   1.440600        66.68
# 7      14  Low Predictability Short SOA   0.722110        66.68

score 0 · Answer 2 · answered Nov 09 '17 at 23:22

This should do the trick:

Data %>% 
  group_by(Subject, Close, SOA) %>% 
  mutate(Peak_Pcent = max(Pcent_Chng)) %>% 
  arrange(Subject, Close, SOA) %>% 
  filter(Peak_Pcent == Pcent_Chng)

The output:

# A tibble: 6 x 6
# Groups:   Subject, Close, SOA [6]
  Subject               Close       SOA   Time Pcent_Chng Peak_Pcent
   <fctr>              <fctr>    <fctr>  <dbl>      <dbl>      <dbl>
1       1 High Predictability  Long SOA  33.34   0.154880   0.154880
2       1 High Predictability Short SOA  66.68   0.394890   0.394890
3       1  Low Predictability  Long SOA  33.34   0.038992   0.038992
4       1  Low Predictability Short SOA -66.68  -0.098007  -0.098007
5      14 High Predictability  Long SOA -33.34   0.149750   0.149750
6      14  Low Predictability  Long SOA  66.68   1.440600   1.440600

I added `summarize(Peak_Latency = Time[which.max(Peak_Pcent)])` to the end of the chain and it gives exactly what I need. Thanks! — JLC, Nov 09 '17 at 23:30

Dplyr summarize with which.max and data with NA's

2 Answers2