R Imputation with Ordered Categorical

Question

DATA=data.frame(x1 = c(sample(c(letters[1:5], NA), 1000, r = T)),
                       x2 = runif(1000),
                       x3 = runif(1000),
                       x4 = sample(letters[20:23], 1000, r = T))

library(mice)
DATAIMPUTE <- complete(mice(DATA,m=5,maxit=50,meth='pmm',seed=500))

I have 'DATA' and wish to impute NA values from 'x1' which is ordered categorical from a, b, c, d, e. I wish to impute using 'x2' and 'x4' but how do you specify which variables to use to impute and before that, see when I try to create DATAIMPUTE I still have 'NA' values in x1...can you please kindly assist?

jay.sf · Accepted Answer · 2021-02-05T06:49:56.797

Your "categorical" variable appears to be in character format. You may want to coerce them into factors before imputing. Otherwise mice() will ignore the variable. Do:

DATA[sapply(DATA, is.character)] <- lapply(DATA[sapply(DATA, is.character)], as.factor)

str(DATA)
# 'data.frame': 1000 obs. of  4 variables:
#  $ x1: Factor w/ 5 levels "a","b","c","d",..: 2 2 NA NA 3 3 4 NA NA 4 ...
#  $ x2: num  0.932 0.87 0.886 0.925 0.984 ...
#  $ x3: num  0.292 0.734 0.764 0.943 0.806 ...
#  $ x4: Factor w/ 4 levels "t","u","v","w": 1 3 1 3 4 3 1 4 3 2 ... 
head(DATA)
#     x1        x2        x3 x4
# 1    b 0.9315629 0.2916144  t
# 2    b 0.8695138 0.7338165  v
# 3 <NA> 0.8863894 0.7642693  t
# 4 <NA> 0.9248280 0.9427943  v
# 5    c 0.9844646 0.8062173  w
# 6    c 0.6200558 0.7354498  v

Also, it might be a better idea to use a proportional odds model ("polr") for ordered categorical data instead of partial mean matching ("pmm").

library(mice)
IMP <- mice(DATA, m=5, maxit=50, meth=c("polr", "", "", ""), seed=500)
DATAIMPUTE <- complete(IMP)
head(DATAIMPUTE)
#   x1        x2        x3 x4
# 1  b 0.9315629 0.2916144  t
# 2  b 0.8695138 0.7338165  v
# 3  a 0.8863894 0.7642693  t
# 4  a 0.9248280 0.9427943  v
# 5  c 0.9844646 0.8062173  w
# 6  c 0.6200558 0.7354498  v

Important note: You seem to missunderstand the method if you think the complete() function gives you a valid imputed dataset (it just action=1 as default and returns just the first completed data set—no multiple imputation at all!). You probably should consult a statistician and read the documentation more thoroughly. There's a nice answer around, that briefly summarizes the most important point.

Data:

set.seed(74)
DATA=data.frame(x1=c(sample(c(letters[1:5], NA), 1000, r=T)), 
                x2=runif(1000), 
                x3=runif(1000), 
                x4=sample(letters[20:23], 1000, r=T))

thanks so much! is it possible to impute multiple variables at the same time, but using different variables to impute each of them? also may i ask why in meth= there is "","","","" is that because those columns are not imputed? — bvowe, Feb 06 '21 at 11:51
@bvowe I wrote `method=c("polr", "", "", "")` to emphasize that there's just the first variable imputed, you can define for each variable the appropriate method. To specify which variables are predicted by which, use the `predictorMatrix=` argument, see `?mice::mice`. — jay.sf, Feb 06 '21 at 12:01
thank you but how do i recover the imputed data set with no missing values? — bvowe, Feb 08 '21 at 19:50
@bvowe Please read [my related answer there](https://stackoverflow.com/a/66059183/6574038). — jay.sf, Feb 08 '21 at 22:53

polkas · Answer 2 · 2021-02-04T22:41:57.550

Solution with miceFast and data.table:

DATA=data.frame(x1 = c(sample(c(letters[1:5], NA), 1000, r = T)),
                x2 = runif(1000),
                x3 = runif(1000),
                x4 = sample(letters[20:23], 1000, r = T))

library(miceFast)
library(data.table)
setDT(DATA)

DATA[, x1_imp := fill_NA(x = .SD, model = "lda", 1, c(2,4))]
DATA
#>         x1        x2        x3 x4 x1_imp
#>    1:    c 0.5008937 0.5243911  v      c
#>    2:    e 0.5962688 0.0934651  w      e
#>    3:    d 0.7137371 0.8820708  u      d
#>    4:    b 0.6072431 0.5465608  v      b
#>    5:    c 0.6145810 0.5505094  v      c
#>   ---                                   
#>  996:    c 0.4813571 0.2091526  w      c
#>  997:    a 0.1862372 0.8401363  t      a
#>  998:    a 0.4391520 0.2364032  u      a
#>  999:    b 0.4673802 0.8268595  v      b
#> 1000: <NA> 0.1752227 0.9582994  t      d

^{Created on 2021-02-04 by the reprex package (v0.3.0)}

.SD is a data.table shortcut for the whole data.frame. 1 is an index value for the posix_y argument (a dependent variable). Take into account that I used lda model in contrast to pmm which you want to use in mice.

thank you so much is there any way you can provide some more information for example what does .SD mean and why is there ", 1, " also is fill_NA the same as MICE? — bvowe, Feb 06 '21 at 11:50

R Imputation with Ordered Categorical

2 Answers2