Removing na from column during summarize_at and funs before performing function

Question

I have repeating dates, a column for returns, and then many other columns of other variables that I am trying to sort on and then extract the average return of the top quartile of the sorted variable constituents for each date. There are NA's and I want to ignore the NA's for each column individually when sorting.

The initial data looks like this

  date  return  a   b   c   d
    2/1/18  5   3   5   4   9
    2/1/18  6   2   1   NA  7
    2/1/18  5   NA  6   NA  5
    2/1/18  NA  1   NA  2   NA
    2/1/18  NA  NA  NA  1   NA
    2/2/18  NA  NA  2   NA  NA
    2/2/18  4   10  4   6   NA
    2/2/18  7   5   NA  2   NA
    2/2/18  8   7   7   9   NA
    2/3/18  NA  2   NA  NA  NA
    2/3/18  3   NA  6   5   8
    2/3/18  6   5   2   4   4
    2/3/18  5   8   8   1   9
    2/4/18  6   8   6   3   1
    2/4/18  5   2   5   9   10
    2/4/18  7   4   2   10  8

I want the end data to come out as follows

date    high a return   high b return   high c return   high d return
2/1/18  5   5   5   5
2/2/18  4   8   8   NA
2/3/18  8   5   3   5
2/4/18  6   6   7   5

I am trying to switch my code from performing the following a bunch of times for different variables.

High = df[!is.na(df$a),] %>%
    group_by(date) %>% 
    filter(a > quantile(a, .666)) %>%
    summarise(high_return = mean(return))

I have switched to this code but have been unable to replicate the way I dealt with na's in the previous code (i.e. the !is.na).

list <- c("a", "b", "c", "d")
High <- df %>%
    group_by(date) %>%
        summarize_at(vars(one_of(list)),
                     funs(HighReturn = mean(return[na.omit(.) > quantile((.), .666, na.rm = TRUE)]))

This does no produce the same results as doing the columns one by one. I have also tried removing the na.omit, adding na.omit to the right side "." and other combinations. Is there a way to produce !is.na for each column as it passes through the function?

Maurits Evers · Answer 1 · 2019-03-18T03:10:00.520

0

I assume that you made a mistake for date = 2/2/18, where high return c should be 8 instead of 9.

Based on above assumption being correct you can do the following

list %>%
    map(~df %>%
        group_by(date) %>%
        filter(!!sym(.x) > quantile(!!sym(.x), 0.666, na.rm = T)) %>%
        summarise(!!sym(paste0("high_return_", .x)) := mean(return))) %>%
    reduce(full_join)
## A tibble: 4 x 5
#  date   high_return_a high_return_b high_return_c high_return_d
#  <fct>          <dbl>         <dbl>         <dbl>         <dbl>
#1 2/1/18             5             5             5             5
#2 2/2/18             4             8             8            NA
#3 2/3/18             5             5             3             5
#4 2/4/18             6             6             7             5

PS. return is not a good column name in R, as return is the name of an internal R function.

Sample data

df <- read.table(text =
    "date  return  a   b   c   d
    2/1/18  5   3   5   4   9
    2/1/18  6   2   1   NA  7
    2/1/18  5   NA  6   NA  5
    2/1/18  NA  1   NA  2   NA
    2/1/18  NA  NA  NA  1   NA
    2/2/18  NA  NA  2   NA  NA
    2/2/18  4   10  4   6   NA
    2/2/18  7   5   NA  2   NA
    2/2/18  8   7   7   9   NA
    2/3/18  NA  2   NA  NA  NA
    2/3/18  3   NA  6   5   8
    2/3/18  6   5   2   4   4
    2/3/18  5   8   8   1   9
    2/4/18  6   8   6   3   1
    2/4/18  5   2   5   9   10
    2/4/18  7   4   2   10  8", header = T)


list <- c("a", "b", "c", "d")

edited Mar 18 '19 at 03:10

answered Mar 18 '19 at 03:03

Maurits Evers

49,617
4
47
68

Yes, I corrected the 2/2/2018 c column in the data. I will try this solution – Kskiaskd Mar 18 '19 at 03:06
I got error "Error in !sym(.x): invalid argument type" – Kskiaskd Mar 18 '19 at 03:15
@Kskiaskd I can't reproduce. This works on the sample data you provided using `dplyr_0.8.0.1` and `purrr_0.2.5`. Have you tried re-running the code from my answer on the sample data? – Maurits Evers Mar 18 '19 at 03:17
Yes, I have tried with your sample data and code. I will check what versions I am running – Kskiaskd Mar 18 '19 at 03:24
is there a method to doing this without using sim(.x)? – Kskiaskd Mar 18 '19 at 03:26
What do you mean by *"doing this without using sim(.x)"*? I assume you are referring to expressions of the form `!!(sym(..))`? `rlang::sym` is necessary to turn a string input into a symbol; the bang-bang operator then unquotes the symbol. This is how `dplyr`'s NSE (non-standard evaluation) works. – Maurits Evers Mar 18 '19 at 03:31
I have purrr 0.2.2 and dplyr 0.7.1 and cannot update. is there any other solution that does not use !!sym(.x) ? – Kskiaskd Mar 18 '19 at 03:39
@Kskiaskd Try `library(rlang)` or replacing all `sym` with `rlang::sym`; I think in earlier `dplyr` versions `rlang::sym` was not re-exported. – Maurits Evers Mar 18 '19 at 03:43

Ronak Shah · Answer 2 · 2019-03-18T03:58:23.810

Following the logic in your approach , we could do

cols <- c("a", "b", "c", "d")

library(dplyr)

df %>%
  group_by(date) %>%
  summarise_at(vars(cols),
    funs(mean(return[. > quantile(., 0.666, na.rm = TRUE)], na.rm = TRUE)))

#   date       a     b     c     d
#  <fct>  <dbl> <dbl> <dbl> <dbl>
#1 2/1/18     5     5     5     5
#2 2/2/18     4     8     8   NaN
#3 2/3/18     5     5     3     5
#4 2/4/18     6     6     7     5

However, since funs is being soft deprecated from dplyr 0.8.0 in favor of list and if we need to rename the columns as well we could use

df %>%
  group_by(date) %>%
  summarise_at(vars(cols), 
    list( ~ mean(return[. > quantile(., 0.666, na.rm = T)], na.rm = TRUE))) %>%
  rename_at(vars(cols), list(~ paste0("high_return_", .)))

#   date   high_return_a high_return_b high_return_c high_return_d
#   <fct>          <dbl>         <dbl>         <dbl>         <dbl>
#1 2/1/18             5             5             5             5
#2 2/2/18             4             8             8           NaN
#3 2/3/18             5             5             3             5
#4 2/4/18             6             6             7             5

Since you're applying a single function you can even omit the `list` statement and use `purrr`s `formula` syntax by itself, i.e. `summarise_at(vars(cols), ~mean(...))` — Maurits Evers, Mar 18 '19 at 05:09

Removing na from column during summarize_at and funs before performing function

2 Answers2

Sample data