I am struggling with using MICE for a dataset. There is a variable that is definitely contingent on another variable and I can't work out how to get MICE to impute only some of the missing values in one variable (and leave the others as genuinely missing).
For example, I have a dataset of sex, pregnancy status and outcome. Only females can be pregnant, so where 'pregnant' is missing but the subject is male, then I don't want to impute a value there.
But I do want to impute a value to pregnancy where this is missing for females. All the variables (including sex and outcome) have some missingness.
I have read the advice here 'R', 'mice', missing variable imputation - how to only do one column in sparse matrix and tried to use the 'where' option in mice.
But using 'where' seems to not impute all of the sex outcomes?
For example;
library(mice)
library(tidyverse)
library(haven)
library(janitor)
# create some data
sex <- c("m","f","m","f","m",NA,NA,"f","f","m","f","f","m","m","f","m")
preg <- c(NA,"not_preg",NA,NA,NA,NA,"preg","not_preg",NA,"not_preg","preg",NA,NA,NA,NA)
outcome <- c(1,0,1,0,0,NA,NA,0,0,1,0,1,1,0,0)
df <- cbind(sex,preg,outcome) %>% as_tibble() %>% mutate(sex=as_factor(sex)) %>% mutate(preg=as_factor(preg))
# look at what's missing
md.pattern(df)
df %>% tabyl(sex,preg)
df %>% tabyl(preg)
# Try to impute over everything to show mice working
mice_a <- mice(df, m=2, maxit=2, seed=3,method="pmm")
df_imp_a <- complete(mice, action="long", include = FALSE)
df_imp_a %>% filter(.imp==1) %>% tabyl(sex,preg) # this has imputed that some men are pregnant (understandably,but not what I want!
df_imp_a %>% filter(.imp==1) %>% tabyl(sex) #but everyone has a sex imputed
df_imp_a %>% filter(.imp==1) %>% tabyl(preg)
# Try to use the 'where' option
# b. Using it with a 'blank' where as proof of principle
grid_b <- is.na(df) #this is just default
mice_b <- mice(df, m=2, maxit=2, seed=3,method="pmm",where=grid_b)
df_imp_b <- complete(mice_b, action="long", include = FALSE)
df_imp_b %>% filter(.imp==1) %>% tabyl(sex,preg) #same problem of pregnant men (obviously, haven't changed anything yet)
df_imp_b %>% filter(.imp==1) %>% tabyl(sex) # but at least everyone has a sex imputed
df_imp_b %>% filter(.imp==1) %>% tabyl(preg)
# c. Making a proper grid of data that I do and don't want imputed
grid_c <- df %>%
mutate(preg=case_when(
sex=="f" & is.na(preg)==TRUE ~ TRUE,
TRUE ~ FALSE
)) %>%
mutate(sex=is.na(sex)) %>%
mutate(outcome=is.na(outcome))
grid_c
grid_c %>% tabyl(preg) # so we are looking for 4 imputed values of 'preg' (so I've done it right -- there are 4 females with unknown pregnancy status)
mice_c <- mice(df,m=2,maxit=2,seed=3,method="pmm",where=grid_c)
df_imp_c <- complete(mice_c,action="long",include=FALSE)
df_imp_c %>% filter(.imp==1) %>% tabyl(sex,preg) # now I have no pregnant men -- which is good!
df_imp_c %>% filter(.imp==1) %>% tabyl(sex) # but I am missing sex for one person??
df_imp_c %>% filter(.imp==1) %>% tabyl(preg) # have imputed all the pregnancy data that I wanted through -- only 7 NAs (for the 7 men)
How do I manage to tell mice that I only want some rows within a certain column imputed and not all of them? But that I do want all rows of another column imputed? Why is the 'where' option not behaving like I thought it would?
All help very much appreciated! Thank you.