1
DATA=data.frame(x1 = c(sample(c(letters[1:5], NA), 1000, r = T)),
                       x2 = runif(1000),
                       x3 = runif(1000),
                       x4 = sample(letters[20:23], 1000, r = T))

library(mice)
DATAIMPUTE <- complete(mice(DATA,m=5,maxit=50,meth='pmm',seed=500))

I have 'DATA' and wish to impute NA values from 'x1' which is ordered categorical from a, b, c, d, e. I wish to impute using 'x2' and 'x4' but how do you specify which variables to use to impute and before that, see when I try to create DATAIMPUTE I still have 'NA' values in x1...can you please kindly assist?

bvowe
  • 3,004
  • 3
  • 16
  • 33

2 Answers2

2

Your "categorical" variable appears to be in character format. You may want to coerce them into factors before imputing. Otherwise mice() will ignore the variable. Do:

DATA[sapply(DATA, is.character)] <- lapply(DATA[sapply(DATA, is.character)], as.factor)

str(DATA)
# 'data.frame': 1000 obs. of  4 variables:
#  $ x1: Factor w/ 5 levels "a","b","c","d",..: 2 2 NA NA 3 3 4 NA NA 4 ...
#  $ x2: num  0.932 0.87 0.886 0.925 0.984 ...
#  $ x3: num  0.292 0.734 0.764 0.943 0.806 ...
#  $ x4: Factor w/ 4 levels "t","u","v","w": 1 3 1 3 4 3 1 4 3 2 ... 
head(DATA)
#     x1        x2        x3 x4
# 1    b 0.9315629 0.2916144  t
# 2    b 0.8695138 0.7338165  v
# 3 <NA> 0.8863894 0.7642693  t
# 4 <NA> 0.9248280 0.9427943  v
# 5    c 0.9844646 0.8062173  w
# 6    c 0.6200558 0.7354498  v

Also, it might be a better idea to use a proportional odds model ("polr") for ordered categorical data instead of partial mean matching ("pmm").

library(mice)
IMP <- mice(DATA, m=5, maxit=50, meth=c("polr", "", "", ""), seed=500)
DATAIMPUTE <- complete(IMP)
head(DATAIMPUTE)
#   x1        x2        x3 x4
# 1  b 0.9315629 0.2916144  t
# 2  b 0.8695138 0.7338165  v
# 3  a 0.8863894 0.7642693  t
# 4  a 0.9248280 0.9427943  v
# 5  c 0.9844646 0.8062173  w
# 6  c 0.6200558 0.7354498  v

Important note: You seem to missunderstand the method if you think the complete() function gives you a valid imputed dataset (it just action=1 as default and returns just the first completed data set—no multiple imputation at all!). You probably should consult a statistician and read the documentation more thoroughly. There's a nice answer around, that briefly summarizes the most important point.


Data:

set.seed(74)
DATA=data.frame(x1=c(sample(c(letters[1:5], NA), 1000, r=T)), 
                x2=runif(1000), 
                x3=runif(1000), 
                x4=sample(letters[20:23], 1000, r=T))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • thanks so much! is it possible to impute multiple variables at the same time, but using different variables to impute each of them? also may i ask why in meth= there is "","","","" is that because those columns are not imputed? – bvowe Feb 06 '21 at 11:51
  • 1
    @bvowe I wrote `method=c("polr", "", "", "")` to emphasize that there's just the first variable imputed, you can define for each variable the appropriate method. To specify which variables are predicted by which, use the `predictorMatrix=` argument, see `?mice::mice`. – jay.sf Feb 06 '21 at 12:01
  • thank you but how do i recover the imputed data set with no missing values? – bvowe Feb 08 '21 at 19:50
  • @bvowe Please read [my related answer there](https://stackoverflow.com/a/66059183/6574038). – jay.sf Feb 08 '21 at 22:53
1

Solution with miceFast and data.table:

DATA=data.frame(x1 = c(sample(c(letters[1:5], NA), 1000, r = T)),
                x2 = runif(1000),
                x3 = runif(1000),
                x4 = sample(letters[20:23], 1000, r = T))

library(miceFast)
library(data.table)
setDT(DATA)

DATA[, x1_imp := fill_NA(x = .SD, model = "lda", 1, c(2,4))]
DATA
#>         x1        x2        x3 x4 x1_imp
#>    1:    c 0.5008937 0.5243911  v      c
#>    2:    e 0.5962688 0.0934651  w      e
#>    3:    d 0.7137371 0.8820708  u      d
#>    4:    b 0.6072431 0.5465608  v      b
#>    5:    c 0.6145810 0.5505094  v      c
#>   ---                                   
#>  996:    c 0.4813571 0.2091526  w      c
#>  997:    a 0.1862372 0.8401363  t      a
#>  998:    a 0.4391520 0.2364032  u      a
#>  999:    b 0.4673802 0.8268595  v      b
#> 1000: <NA> 0.1752227 0.9582994  t      d

Created on 2021-02-04 by the reprex package (v0.3.0)

.SD is a data.table shortcut for the whole data.frame. 1 is an index value for the posix_y argument (a dependent variable). Take into account that I used lda model in contrast to pmm which you want to use in mice.

polkas
  • 3,797
  • 1
  • 12
  • 25
  • thank you so much is there any way you can provide some more information for example what does .SD mean and why is there ", 1, " also is fill_NA the same as MICE? – bvowe Feb 06 '21 at 11:50