select non-missing variables in a purrr loop

Question

Consider this example

mydata <- data_frame(ind_1 = c(NA,NA,3,4),
                     ind_2 = c(2,3,4,5),
                     ind_3 = c(5,6,NA,NA),
                     y = c(28,34,25,12),
                     group = c('a','a','b','b'))

> mydata
# A tibble: 4 x 5
  ind_1 ind_2 ind_3     y group
  <dbl> <dbl> <dbl> <dbl> <chr>
1    NA     2     5    28 a    
2    NA     3     6    34 a    
3     3     4    NA    25 b    
4     4     5    NA    12 b

Here I want, for each group, regress y on whatever variable is not missing in that group, and store the corresponding lm object in a list-column.

That is:

for group a, these variables correspond to ind_2 and ind_3
for group b, they correspond to ind_1 and ind_2

I tried the following but this does not work

mydata %>% group_by(group) %>% nest() %>% 
  do(filtered_df <- . %>% select(which(colMeans(is.na(.)) == 0)),
     myreg = lm(y~ names(filtered_df)))

Any ideas? Thanks!

score 8 · Accepted Answer · answered Sep 10 '18 at 20:16

We can use map and mutate. We can either select and model in one step (nestdat1) or in separate steps using two map's if you want to preserve the filtered data (nestdat2):

library(tidyverse)

nestdat1 <- mydata %>%
  group_by(group) %>%
  nest() %>%
  mutate(model = data %>% map(~ select_if(., funs(!any(is.na(.)))) %>%
                                lm(y ~ ., data = .)))

nestdat2 <- mydata %>%
  group_by(group) %>%
  nest() %>%
  mutate(data = data %>% map(~ select_if(., funs(!any(is.na(.))))),
         model = data %>% map(~ lm(y ~ ., data = .)))

Output:

They produce different data columns:

> nestdat1 %>% pull(data)
[[1]]
# A tibble: 2 x 4
  ind_1 ind_2 ind_3     y
  <dbl> <dbl> <dbl> <dbl>
1    NA     2     5    28
2    NA     3     6    34

[[2]]
# A tibble: 2 x 4
  ind_1 ind_2 ind_3     y
  <dbl> <dbl> <dbl> <dbl>
1     3     4    NA    25
2     4     5    NA    12

> nestdat2 %>% pull(data)
[[1]]
# A tibble: 2 x 3
  ind_2 ind_3     y
  <dbl> <dbl> <dbl>
1     2     5    28
2     3     6    34

[[2]]
# A tibble: 2 x 3
  ind_1 ind_2     y
  <dbl> <dbl> <dbl>
1     3     4    25
2     4     5    12

But the same model column:

> nestdat1 %>% pull(model)
[[1]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_2        ind_3  
         16            6           NA  

[[2]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_1        ind_2  
         64          -13           NA  


> nestdat2 %>% pull(model)
[[1]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_2        ind_3  
         16            6           NA  

[[2]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_1        ind_2  
         64          -13           NA

just a follow up. how can I manually add a few extra variables to the `select_if`? — ℕʘʘḆḽḘ, Sep 10 '18 at 20:22
say, keep the variables that are non-missing, but also keep the variable `date` — ℕʘʘḆḽḘ, Sep 10 '18 at 20:22
@ℕʘʘḆḽḘ That's more complicated, but feasible I guess with something like: `mydata %>% select_if(sapply(., function(x) !any(is.na(x))) | grepl("date", names(.)))` or `mydata %>% select(matches("date"), names(.)[map_lgl(., ~!any(is.na(.)))])` — acylam, Sep 10 '18 at 20:39
@ℕʘʘḆḽḘ Check this answer: https://stackoverflow.com/questions/48032969/dplyrselect-if-can-use-colnames-and-their-values-at-the-same-time — acylam, Sep 10 '18 at 20:39
haha I saw, I am trying to replace the `sapply(., mean) ` by `map` here... — ℕʘʘḆḽḘ, Sep 10 '18 at 20:45
@ℕʘʘḆḽḘ With `map` it would be `mydata %>% select_if(map_lgl(., ~ !any(is.na(.))) | grepl("date", names(.))) ` — acylam, Sep 10 '18 at 20:47

score 2 · Answer 2 · answered Sep 10 '18 at 22:01

Here's another tidyverse option, assign to mydata$model if you wish to keep it in your tibble :

library(tidyverse)
mydata %>%
  nest(-group) %>%
  pull(data) %>%
  map(~lm(y ~., discard(.,anyNA)))
# [[1]]
# 
# Call:
# lm(formula = y ~ ., data = discard(., anyNA))
# 
# Coefficients:
# (Intercept)        ind_2        ind_3  
#          16            6           NA  
# 
# 
# [[2]]
# 
# Call:
# lm(formula = y ~ ., data = discard(., anyNA))
# 
# Coefficients:
# (Intercept)        ind_1        ind_2  
#          64          -13           NA  
# 
#

select non-missing variables in a purrr loop

2 Answers2