This is correct! First, you used mice(., m=5)
(the default) to impute yout data set five times. Using complete(., action=long)
, you combined all five imputations in a long format. To distinguish the individual imputations, two variables are added, .imp
, which distinguishes between the five imputations, and .id
which are the initial row names.
library(mice)
imp <- mice(nhanes, m=3)
nhanes_imp <- complete(imp, action='long')
nhanes_imp
# .imp .id age bmi hyp chl
# 1 1 1 1 29.6 1 187
# 2 1 2 2 22.7 1 187
# 3 1 3 1 29.6 1 187
# [...]
# 26 2 1 1 22.7 1 118
# 27 2 2 2 22.7 1 187
# 28 2 3 1 30.1 1 187
# [...]
# 51 3 1 1 27.2 1 131
# 52 3 2 2 22.7 1 187
# 53 3 3 1 24.9 1 187
# [...]
# 76 4 1 1 22.0 1 113
# 77 4 2 2 22.7 1 187
# 78 4 3 1 22.0 1 187
# [...]
# 101 5 1 1 35.3 1 187
# 102 5 2 2 22.7 1 187
# 103 5 3 1 35.3 1 187
# [...]
Naturally your imputed data set has five times the number of rows than you initial one.
nrow(nhanes_imp) / nrow(nhanes)
# [1] 5
You should never use complete without action='long'
(see my older answer there).
Continue by pooling your calculations. For instance, for OLS you may use the pool()
function, which comes with mice
, that basically averages what lm
is doing, over the five imputation versions.
fit <- with(data=imp, exp=lm(bmi ~ hyp + chl))
summary(pool(fit))
# term estimate std.error statistic df p.value
# 1 (Intercept) 21.38468643 4.58030244 4.6688372 16.64367 0.0002323604
# 2 hyp -1.89607759 2.18239135 -0.8688073 19.00235 0.3957936019
# 3 chl 0.03942668 0.02449571 1.6095343 15.72940 0.1273825300
In case we mistakenly do OLS without pooling the imputed data sets, the number of observations is blown up to five times of it's actually size. Hence the degrees of freedom are to large, and the variance and all statistics depending on it underestimated:
summary(lm(bmi ~ hyp + chl, nhanes_imp))
# Call:
# lm(formula = bmi ~ hyp + chl, data = nhanes_imp)
#
# Residuals:
# Min 1Q Median 3Q Max
# -6.9010 -2.7027 0.3682 3.0993 8.4682
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 21.165549 1.794706 11.793 < 2e-16 ***
# hyp -1.920889 0.907041 -2.118 0.0362 *
# chl 0.040573 0.009444 4.296 3.51e-05 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.934 on 122 degrees of freedom
# Multiple R-squared: 0.1346, Adjusted R-squared: 0.1205
# F-statistic: 9.492 on 2 and 122 DF, p-value: 0.0001475