I have more observations than original dataset after imputation using mice ? original data (impute) observations = 27727 vs imputed (impdat) = 138635

Question

data <- read.csv("Documents/ABA/dataset.csv")
df <- subset(data, select=c(k7, n3, n2a, d1a1x, k17, bmgc23g, m1a_corruption_pos, 
                            j30_permit_pos, bmge1, lcu, j30_instability_pos, 
                            bmgc25))

#filtering dataset for selected variable
impute <- df[c("k7","k17","d1a1x","bmgc23g", "m1a_corruption_pos", 
               "j30_permit_pos", "bmge1", "lcu", "j30_instability_pos",
               "bmgc25")]

tempData <- mice(impute, m=5, maxit=10, method="pmm", seed=500)

impdat <- complete(tempData, action="long", include=FALSE)

May I know what is wrong or how it can fixed ?

score 0 · Answer 1 · answered Dec 01 '21 at 06:01

This is correct! First, you used mice(., m=5) (the default) to impute yout data set five times. Using complete(., action=long), you combined all five imputations in a long format. To distinguish the individual imputations, two variables are added, .imp, which distinguishes between the five imputations, and .id which are the initial row names.

library(mice)
imp <- mice(nhanes, m=3)

nhanes_imp <- complete(imp, action='long')
nhanes_imp
#     .imp .id age  bmi hyp chl
# 1      1   1   1 29.6   1 187
# 2      1   2   2 22.7   1 187
# 3      1   3   1 29.6   1 187
# [...]
# 26     2   1   1 22.7   1 118
# 27     2   2   2 22.7   1 187
# 28     2   3   1 30.1   1 187
# [...]
# 51     3   1   1 27.2   1 131
# 52     3   2   2 22.7   1 187
# 53     3   3   1 24.9   1 187
# [...]
# 76     4   1   1 22.0   1 113
# 77     4   2   2 22.7   1 187
# 78     4   3   1 22.0   1 187
# [...]
# 101    5   1   1 35.3   1 187
# 102    5   2   2 22.7   1 187
# 103    5   3   1 35.3   1 187
# [...]

Naturally your imputed data set has five times the number of rows than you initial one.

nrow(nhanes_imp) / nrow(nhanes)
# [1] 5

You should never use complete without action='long' (see my older answer there).

Continue by pooling your calculations. For instance, for OLS you may use the pool() function, which comes with mice, that basically averages what lm is doing, over the five imputation versions.

fit <- with(data=imp, exp=lm(bmi ~ hyp + chl))
summary(pool(fit))
#          term    estimate  std.error  statistic       df      p.value
# 1 (Intercept) 21.38468643 4.58030244  4.6688372 16.64367 0.0002323604
# 2         hyp -1.89607759 2.18239135 -0.8688073 19.00235 0.3957936019
# 3         chl  0.03942668 0.02449571  1.6095343 15.72940 0.1273825300

In case we mistakenly do OLS without pooling the imputed data sets, the number of observations is blown up to five times of it's actually size. Hence the degrees of freedom are to large, and the variance and all statistics depending on it underestimated:

summary(lm(bmi ~ hyp + chl, nhanes_imp))
# Call:
# lm(formula = bmi ~ hyp + chl, data = nhanes_imp)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -6.9010 -2.7027  0.3682  3.0993  8.4682 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 21.165549   1.794706  11.793  < 2e-16 ***
# hyp         -1.920889   0.907041  -2.118   0.0362 *  
# chl          0.040573   0.009444   4.296 3.51e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.934 on 122 degrees of freedom
# Multiple R-squared:  0.1346,  Adjusted R-squared:  0.1205 
# F-statistic: 9.492 on 2 and 122 DF,  p-value: 0.0001475

I have more observations than original dataset after imputation using mice ? original data (impute) observations = 27727 vs imputed (impdat) = 138635

1 Answers1