0

I thought I had the answer to my question here, but when I used with my larger data set I get different results. I suspect the difference is because of the way the na.locf line is acting.

Basically I am converting code where I used mutate_at to the new syntax with mutate(across()).

In the first case below, the data is filled correctly, because df_initial is still grouped by the index_name. In the second case, I'm assuming because I had to ungroup for the mutate across to work, I get a different answer.

So here is another example with a larger data set to illustrate the problem.

Reproducible example:

df_initial <- 
structure(list(Date = structure(c(18681, 18681, 18681, 18681, 
                                  18682, 18682, 18682, 18682, 18683, 18683, 18683, 18683, 18684, 
                                  18684, 18684, 18684, 18685, 18685, 18685, 18685, 18686, 18686, 
                                  18686, 18686), class = "Date"), index_name = c("INDU Index", 
                                                                                 "SPX Index", "TPX Index", "MEXBOL Index", "INDU Index", "SPX Index", 
                                                                                 "TPX Index", "MEXBOL Index", "INDU Index", "SPX Index", "TPX Index", 
                                                                                 "MEXBOL Index", "INDU Index", "SPX Index", "TPX Index", "MEXBOL Index", 
                                                                                 "INDU Index", "SPX Index", "TPX Index", "MEXBOL Index", "INDU Index", 
                                                                                 "SPX Index", "TPX Index", "MEXBOL Index"), index_level = c(31537.35, 
                                                                                                                                            3881.37, NA, 45268.33, 31961.86, 3925.43, 1903.07, 45151.38, 
                                                                                                                                            31402.01, 3829.34, 1926.23, 44310.27, 30932.37, 3811.15, 1864.49, 
                                                                                                                                            44592.91, NA, NA, NA, NA, NA, NA, NA, NA), totalReturn_daily = c(0.0497, 
                                                                                                                                                                                                             0.1277, 0, 0.7158, 1.3461, 1.1364, -1.8201, -0.1151, -1.7181, 
                                                                                                                                                                                                             -2.4339, 1.2411, -1.8629, -1.4628, -0.4636, -3.2052, 0.6379, 
                                                                                                                                                                                                             0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -24L), groups = structure(list(
                                                                                                                                                                                                               index_name = c("INDU Index", "MEXBOL Index", "SPX Index", 
                                                                                                                                                                                                                              "TPX Index"), .rows = structure(list(c(1L, 5L, 9L, 13L, 17L, 
                                                                                                                                                                                                                                                                     21L), c(4L, 8L, 12L, 16L, 20L, 24L), c(2L, 6L, 10L, 14L, 
                                                                                                                                                                                                                                                                                                            18L, 22L), c(3L, 7L, 11L, 15L, 19L, 23L)), ptype = integer(0), class = c("vctrs_list_of", 
                                                                                                                                                                                                                                                                                                                                                                                     "vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                "tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               "tbl_df", "tbl", "data.frame"))

The first approach below gives the correct values, but the second approach below does not. So I am trying to get the same answer in approach #2 and I get in approach #1.

# Approach 1: Expected output received here:
df_initial %>%
  mutate_at(vars(-index_name, -totalReturn_daily),
            ~ na.locf(., na.rm = FALSE)) %>%
  filter(index_name == "TPX Index")

# Output
  Date       index_name index_level totalReturn_daily
  <date>     <chr>            <dbl>             <dbl>
1 2021-02-23 TPX Index          NA               0   
2 2021-02-24 TPX Index        1903.             -1.82
3 2021-02-25 TPX Index        1926.              1.24
4 2021-02-26 TPX Index        1864.             -3.21
5 2021-02-27 TPX Index        1864.              0   
6 2021-02-28 TPX Index        1864.              0  

# Approach 2: Did not receive expected output here
df_initial %>%
  ungroup() %>%
  mutate(across(
    .cols = -c(index_name, totalReturn_daily),
    .fns  = ~ na.locf(., na.rm = FALSE)
  )) %>%
  filter(index_name == "TPX Index")

# Output
  Date       index_name index_level totalReturn_daily
  <date>     <chr>            <dbl>             <dbl>
1 2021-02-23 TPX Index        3881.              0   
2 2021-02-24 TPX Index        1903.             -1.82
3 2021-02-25 TPX Index        1926.              1.24
4 2021-02-26 TPX Index        1864.             -3.21
5 2021-02-27 TPX Index       44593.              0   
6 2021-02-28 TPX Index       44593.              0  

Thanks!

DaveM
  • 664
  • 6
  • 19

1 Answers1

1

Both approaches gave similar results for me. Could you try the code below?

library(zoo)
df_initial %>%
  group_by(index_name) %>% 
  mutate_at(vars(-index_name, -totalReturn_daily),
            ~ na.locf(., na.rm = FALSE)) %>% 
  dplyr::filter(index_name == "TPX Index") 


df_initial %>%
  group_by(index_name) %>% 
  mutate(across(
    .cols = -c(totalReturn_daily),
    .fns  = ~ na.locf(., na.rm = FALSE)
  )) %>%
  ungroup() %>% 
  dplyr::filter(index_name == "TPX Index")
jsv
  • 740
  • 3
  • 5
  • Indeed it does. Why does the `-index_name` not need to be in the second version but it does need to be in the first version? – DaveM Mar 04 '21 at 01:03
  • It needn't be there in the first approach as well. – jsv Mar 04 '21 at 01:05
  • I think I missed somewhere that if you have a variable in `group_by`, it is shielded from the `mutate_at` or `mutate` `across`? In other words, I do not need to specify to exclude that variable from a mutate command as it is automatically excluded? – DaveM Mar 04 '21 at 01:10
  • This might help: https://github.com/tidyverse/dplyr/issues/4572 – jsv Mar 04 '21 at 01:12
  • Great, thanks for the reference and for the solution! – DaveM Mar 04 '21 at 01:18