4

I am struggling with using MICE for a dataset. There is a variable that is definitely contingent on another variable and I can't work out how to get MICE to impute only some of the missing values in one variable (and leave the others as genuinely missing).

For example, I have a dataset of sex, pregnancy status and outcome. Only females can be pregnant, so where 'pregnant' is missing but the subject is male, then I don't want to impute a value there.

But I do want to impute a value to pregnancy where this is missing for females. All the variables (including sex and outcome) have some missingness.

I have read the advice here 'R', 'mice', missing variable imputation - how to only do one column in sparse matrix and tried to use the 'where' option in mice.

But using 'where' seems to not impute all of the sex outcomes?

For example;

library(mice)
library(tidyverse)
library(haven)
library(janitor)

# create some data
sex <- c("m","f","m","f","m",NA,NA,"f","f","m","f","f","m","m","f","m")
preg <- c(NA,"not_preg",NA,NA,NA,NA,"preg","not_preg",NA,"not_preg","preg",NA,NA,NA,NA) 
outcome <- c(1,0,1,0,0,NA,NA,0,0,1,0,1,1,0,0)
df <- cbind(sex,preg,outcome) %>% as_tibble() %>% mutate(sex=as_factor(sex)) %>% mutate(preg=as_factor(preg))

# look at what's missing
md.pattern(df)
df %>% tabyl(sex,preg)
df %>% tabyl(preg)

# Try to impute over everything to show mice working
mice_a <- mice(df, m=2, maxit=2, seed=3,method="pmm")
df_imp_a <- complete(mice, action="long", include = FALSE)

df_imp_a %>% filter(.imp==1) %>% tabyl(sex,preg)  # this has imputed that some men are pregnant (understandably,but not what I want!
df_imp_a %>% filter(.imp==1) %>% tabyl(sex) #but everyone has a sex imputed
df_imp_a %>% filter(.imp==1) %>% tabyl(preg)

# Try to use the 'where' option

# b. Using it with a 'blank' where as proof of principle

grid_b <- is.na(df) #this is just default
mice_b <- mice(df, m=2, maxit=2, seed=3,method="pmm",where=grid_b)
df_imp_b <- complete(mice_b, action="long", include = FALSE)
df_imp_b %>% filter(.imp==1) %>% tabyl(sex,preg) #same problem of pregnant men (obviously, haven't changed anything yet)
df_imp_b %>% filter(.imp==1) %>% tabyl(sex) # but at least everyone has a sex imputed
df_imp_b %>% filter(.imp==1) %>% tabyl(preg)

# c. Making a proper grid of data that I do and don't want imputed

grid_c <- df %>%
  mutate(preg=case_when(
    sex=="f" & is.na(preg)==TRUE ~ TRUE,
    TRUE ~ FALSE
  )) %>%
  mutate(sex=is.na(sex)) %>%
  mutate(outcome=is.na(outcome))

grid_c
grid_c %>% tabyl(preg) # so we are looking for 4 imputed values of 'preg' (so I've done it right -- there are 4 females with unknown pregnancy status)

mice_c <- mice(df,m=2,maxit=2,seed=3,method="pmm",where=grid_c)
df_imp_c <- complete(mice_c,action="long",include=FALSE)

df_imp_c %>% filter(.imp==1) %>% tabyl(sex,preg) # now I have no pregnant men -- which is good!
df_imp_c %>% filter(.imp==1) %>% tabyl(sex) # but I am missing sex for one person??
df_imp_c %>% filter(.imp==1) %>% tabyl(preg) # have imputed all the pregnancy data that I wanted through -- only 7 NAs (for the 7 men)

How do I manage to tell mice that I only want some rows within a certain column imputed and not all of them? But that I do want all rows of another column imputed? Why is the 'where' option not behaving like I thought it would?

All help very much appreciated! Thank you.

slamballais
  • 3,161
  • 3
  • 18
  • 29
richardb
  • 73
  • 6

2 Answers2

1

I had a similar problem and did not want to impute cells in over 70 columns if age was younger than 15. The following short code was very helpful.

Include where= miss.infor.data to your mice() code.

#copy your dataset
    df2 <- df 

# Set missing cells to 100 in columns 248 to 320 for those over age 15
    df2[df2$age < 15, 248:320] <- 100 

#create the logical in which those with a value 100 are not set to TRUE so they will not be imputed in the where option.
    miss.infor.data <-as.data.frame(lapply(AddedValuedat2, is.na)) 

Sem
  • 43
  • 1
  • 7
0

After a lot of experimenting, it seems that mice has a problem with the fact that you don't allow imputation of preg where sex is NA. It seems to work if you set up grid_c as follows:

grid_c <- df %>%
  mutate(preg=case_when(
    (sex=="f"|is.na(sex)) & is.na(preg)==TRUE ~ TRUE,
    TRUE ~ FALSE
  )) %>%
  mutate(sex=is.na(sex)) %>%
  mutate(outcome=is.na(outcome))

Note the change in (sex=="f"|is.na(sex)).

The downside to this is that you get some men who are not_preg. While technically correct, you probably want them set NA, too. So you could either do that in post-processing. You may alternatively avoid the problem before imputation by adding another category to preg that codes the inability to be pregnant for men (instead of NA) before imputation.

Now, if you do the above, you will run into a second problem: Now you have remaining NAs in outcome. This seems to be because your test data does not contain enough information to impute all missing values. Note that your 6th row contains no information at all, so mice has no data on that person to feed into the pmm algorithm.
You can solve this if you include further variables that may contain information about the missing values (and are not NA for the person/rows in question). If you don't have this data, then exclude that person because you don't really have them in your sample, at all.

Last, if you go on testing the procedure with example data like yours, make it a longer dataset. While experimenting with it, I got further warnings (in mice_c$loggedEvents), which are due to low number of cases for some combinations of your categorical variables.

benimwolfspelz
  • 679
  • 5
  • 17