2

I want to find the non-missing value of each group associated with the largest index value, for many columns.

I have gotten fairly close by using summarize_all with which.max but I am not sure how to remove the NAs from each vector before I find the latest value. I read about using na.rm in summarize_all with functions like mean but not sure how to incorporate similar functionality without a built in function. I have tried na.omit but it doesnt provide the solution I'm looking for.

a <- head(iris, 10)
a$num <- 1:10
a$grp <- c("a","a","a","b","b","c","c","d","d","d")
a[10, "Species"] <- NA
a %>%
  group_by(grp) %>%
  summarize_all(funs(na.omit(.)[which.max(num)]))

grp   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   num
<chr>          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
1 a             4.70        3.20         1.30       0.200 setosa      3
2 b             5.00        3.60         1.40       0.200 setosa      5
3 c             4.60        3.40         1.40       0.300 setosa      7
4 d             4.90        3.10         1.50       0.100 NA         10

I expect all the values in the Species column to be setosa, however the last value is NA.

Hunter Clark
  • 181
  • 13

3 Answers3

1

if you use a data.table approach, you can try:

library (data.table)
a = data.table (a)
a [is.finite (Species), by = grp, .SD [which.max (num) ] ]
JVP
  • 309
  • 1
  • 11
1

Instead of looking at all num, we may look only at those where the corresponding variable is not NA:

a %>%
  group_by(grp) %>%
  summarize_all(funs(na.omit(.)[which.max(num[!is.na(.)])]))
# A tibble: 4 x 7
#   grp   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   num
#   <chr>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
# 1 a              4.7         3.2          1.3         0.2 setosa      3
# 2 b              5           3.6          1.4         0.2 setosa      5
# 3 c              4.6         3.4          1.4         0.3 setosa      7
# 4 d              4.9         3.1          1.5         0.1 setosa     10
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • 1
    No need for `na.omit(.)`, `summarize_all(funs(.[which.max(num[!is.na(.)])]))` did the job. – A. Suliman Jan 04 '19 at 23:21
  • Thanks @Julius Vainora! very helpful. – Hunter Clark Jan 05 '19 at 16:25
  • I have encountered an interesting issue with this approach. I ran the code above on my data set and I had a couple records that still had NAs. After much research it seems to have something to do with the order of the rows. You can see what I mean with the below example. `a <- head(iris, 15) a$num <- 1:15 a$grp <- c("a","a","a","b","b","c","c","d","d","d","d","d","d","d","d") a[c(12, 13), "Petal.Width"] <- NA a <- a %>% arrange(c(1:11, 14, 13, 12, 15)) #a <- rbind(a, a[12:14, ]) #a[10, "Species"] <- NA a %>% group_by(grp) %>% summarize_all(funs(.[which.max(num[!is.na(.)])]))` – Hunter Clark Jan 08 '19 at 17:10
  • 1
    @HunterClark, I think it was a mistake to drop `na.omit` after all. What about `funs(na.omit(.)[which.max(num[!is.na(.)])])`? Does seem to work. Indeed dropping `na.omit` introduced mess.. – Julius Vainora Jan 08 '19 at 17:13
  • Thanks again @Julius. that seems to work. I dont understand why the issue came up though... – Hunter Clark Jan 08 '19 at 17:15
  • @HunterClark, `num[!is.na(.)]` is a subvector of `num` where the corresponding variable (`.`) is not `NA`. So, this subvector may be shorter than `num`. Then `which.max` returns an index in this potentially shorter subvector. Then doing `.[which.max(num[!is.na(.)])]` was wrong because the corresponding variable wasn't shortened before subsetting. `na.omit(.)` aligns it with `num[!is.na(.)]`. – Julius Vainora Jan 08 '19 at 17:18
  • @Julius, ahhhhhh. thanks much for the clarification. Would you like to adjust your answer or should I edit the main post to update the information? – Hunter Clark Jan 08 '19 at 17:20
  • @HunterClark, everything's great with your question and I've already erased the mistake from my answer. – Julius Vainora Jan 08 '19 at 17:22
1

You could also approach this a little differently and complete the NA case first:

library(tidyverse)

a %>% group_by(grp) %>% 
  fill(Species) %>% 
  filter(num == max(num))

tibble: 4 x 7
# Groups:   grp [4]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   num grp  
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int> <chr>
1          4.7         3.2          1.3         0.2 setosa      3 a    
2          5           3.6          1.4         0.2 setosa      5 b    
3          4.6         3.4          1.4         0.3 setosa      7 c    
4          4.9         3.1          1.5         0.1 setosa     10 d 
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • Thanks for the response Mako212. While your answer works great for the example case, in looking into fill it may cause difficulties if the index isn't sorted previously. Because of this I will opt for Julius' answer but I will keep fill in mind for the future. – Hunter Clark Jan 05 '19 at 16:24