0

I did multiple imputation using Amelia using the following code

binary<- c("Gender",  "Diabetes")
exclude.from.IMPUTATION<-c( "Serial.ID")
NPvars<- c("age",  "HDEF","BMI")#a skewed (non-parametric variable

a.out <- Amelia::amelia(x = for.imp.data,m=10,
                idvars=exclude.from.IMPUTATION,
                noms = binary, logs =NPvars)
summary(a.out)

## save imputed datasets ##
Amelia::write.amelia(obj=a.out, file.stem = "impdata", format = "csv")

I had 10 different output data csv files (shown in the picture below)

enter image description here

and I know that I can use any one of them to do descriptive analysis as shown in prior questions but

  1. Why we should do MULTIPLE imputation if we will use any SINGLE file of them?

  2. Some authors reported using Rubin's Rule to summarize across imputations as shown here, please advice on how to do that.

enter image description here

Mohamed Rahouma
  • 1,084
  • 9
  • 20
  • 1
    Your different data sets express the uncertainty of the imputation. You shouldn't use just one of them, that would be wrong. You need to pool your regressions by taking the within and between variances into account. Write code using Rubin's rules which you can find in: _Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley._ on page 76. If you like it automated, in the `mice` package the process is implemented for `lm()` by the [`mice::pool()` function](https://stackoverflow.com/a/54273071/6574038). (Maybe it's also implemented in `Amelia` - I don't know.) – jay.sf Aug 29 '19 at 13:41
  • This question seems to be more about statistics than programming. Such questions are better asked at [stat.se] where statistical questions are on topic. – MrFlick Aug 29 '19 at 16:14

1 Answers1

1

You do not use just one of these dataset. As you correctly stated, then the whole process of multiple imputation would be useless.

As jay.sf said, the different datasets express the uncertainty of the imputation. The missing data is ultimately lost - we can only estimate, what the real data could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Overall, this can be used to say something like: the missing data most likely lies between ... and ... .

When you are generating descriptive statistics, you generate these for each of the imputed datasets separately. Looking at for example at the mean, you could then e.g. provide as additional information, the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets. This way your readers will know how much uncertainty comes with the imputation.

You can also use your imputed datasets to describe the uncertainty for the output of linear models. You do this by using Rubin´s Rules (RR) to pool parameter estimates, such as mean differences, regression coefficients, standard errors and to derive confidence intervals and p-values. (see also https://bookdown.org/mwheymans/bookmi/rubins-rules.html)

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55