4

Im struggling with an imputation using mice. The main objective is to impute NAs (if possible by group). As the sample is a bit large to simple post here it is downloadable: https://drive.google.com/open?id=1InGJ_M7r5jwQZZRdXBO1MEbKB48gafbP

My questions are:

  1. How big of an issue is correlated data in general? What can I do to still impute the data? The data is part of an empirical research question and I don't yet know which variables to include, thus it'd be best to keep as many as possible for the time being.

  2. What methods would be more suitable than "cart" & "pmm" ? I'd like not to simply impute the mean/median....

  3. Can I somehow impute the data by "ID"

  4. Tips for debugging?

Here my code

#Start
require(mice)
require(Hmisc)
'setwd(...)
'test.df<-read.csv(...)
str(test.df)

Check for correlation: The first 2 columns contain identifiers and Year thus no need to look into.

test.df.rcorr<-rcorr(as.matrix(test.df[,-c(1:2)]))
test.df.coeff<-test.df.rcorr$r
test.df.coeff<-corrplot(test.df.coeff)

As can be seen there is some strong correlation in the data. For a simple task omit all columns with strong correlation.

#Simple example

test.df2<-test.df[,-c(4,7,10,11)]
test.df2
sum(is.na(test.df2))

Now, lets impute the test.df2 without specifying the method:

imputation.df2<-mice(test.df2, m=1, seed=123456)
imputation.df2$method
test.df2.imp<-mice::complete(imputation.df2)

Warning message:
Number of logged events: 1 


sum(is.na(test.df2.imp))

As can be seen, all the NAs are imputed. And the method used is "pmm" only.

Using the full data set, I get the following error message almost immediately:

imputation.df<-mice(test.df,m=1,seed = 66666)

 iter imp variable
  1   1  x1Error in solve.default(xtx + diag(pen)) : 
  system is computationally singular: reciprocal condition number = 1.49712e-16

Is this merely due to the correlation in the data?

Finally, my code for imputation by ID, which runs a little longer before showing this error:

test123<- lapply(split(test.df, test.df$ID), function(x) mice::complete(mice(x, m = 1 ,seed = 987654)))
Error in edit.setup(data, setup, ...) : nothing left to impute
In addition: There were 19 warnings (use warnings() to see them)
Called from: edit.setup(data, setup, ...)

I know this is a long question, and I m grateful for every little tip or hint!

Thanks a bunch!

Juan
  • 171
  • 1
  • 12
  • Excuse my ignorance, but what do you mean by 'impute'? By definition, it means: 'to lay the responsibility or blame for (something) often falsely or unjustly'. Maybe you mean amputate, as in subsetting? – Hart Radev Nov 13 '19 at 10:33
  • 1
    Fitting values for the NAs in the data. This can be done by replacing the NAs with the data mean for example. https://en.wikipedia.org/wiki/Imputation_(statistics) – Juan Nov 13 '19 at 10:35
  • 2
    I see, thanks for the explanation. I would suggest asking the theoretical/reasoning part of your questions in [https://stats.stackexchange.com/], as they have more experience in Statistical knowledge. – Hart Radev Nov 13 '19 at 10:44
  • 2
    I don't get why you'd want to exclude correlated columns. If you want to impute values, information from a strongly correlated column would appear to be most useful. Or are you referring to auto-correlation? Then I'd suggest using the Amelia package which can include auto-correlation in the imputation model. – Roland Nov 13 '19 at 11:14
  • The reason for excluding is merely derived from the error i am getting and that i read that it might be an issue for mice. I did not look into amelia yet. will do so right now! – Juan Nov 13 '19 at 11:17
  • 1
    You might want to try single imputation packages (if you don't seem to need multiple imputed values anyway). They are often way easier to use. E.g look at packages missForest, VIM, imputeR. – Steffen Moritz Nov 15 '19 at 17:02
  • 1
    Another comment: The problem is indeed related to the strongly correlated variables. See also here https://stats.stackexchange.com/questions/76488/error-system-is-computationally-singular-when-running-a-glm. This is only a issue with the default algorithm you are using mice with. (glm). If you want to continue to use mice, you can also just set the method parameter to another algorithm. – Steffen Moritz Nov 15 '19 at 17:06
  • @stats0007, while single imputation is easier to use it usually produces downward biased standard errors (good for parameter estimates, bad for hypothesis tests or other analyses that use the SE/variance). Graham (2009) has a nice article "Missing Data Analysis: Making It Work in the Real World" that discusses the advantages and disadvantages of multiple and single imputation – Niek Nov 21 '19 at 14:44

1 Answers1

7

I think the problem arises because you are dealing with longitudinal data and mice is treating the observations as independent. Longitudinal data is clustered by ID and one way to deal with this is by using a multilevel (i.e. mixed) model as your imputation model. mice has numerous options to deal with this kind of data, which you can specify in your predictor matrix and imputation method.

library(mice)
setwd("X:/My Downloads")

test.df <- read.csv("Impute.csv")

You need to specify that ID is your grouping or class variable. Unfortunately mice can only handle integer values for this variable, so you need to change it to an integer (you can always change this back after imputation).

test.df$ID <- as.integer(test.df$ID)

You can get your predictor matrix and imputation method easily with a dry run of mice (i.e. imputation with 0 iterations).

ini<-mice(test.df,maxit=0)

pred1<-ini$predictorMatrix
pred1[,"ID"]<- -2 # set ID as class variable for 2l.norm
pred1[,"year"]<- 2 # set year as a random effect, slopes differ between individuals

A value of 1 in the predictor matrix indicates that the column variable is used as a fixed effect predictor to impute the target (row) variable, and a 0 means that it is not used. -2 indicates that the variable is a class variable (your ID) and a value of 2 indicates that the variable is to be used as a random effect. For the details you need to read up on multilevel modeling, but basically you can use year as a fixed effect to specify that each individual shows the same general growth (same effect of year for each individual on any other variable) or as a random effect to model the more complicated assumption that individuals differ in growth. You can look at your data to see if the simple model sufficiently fits your observed data or if a more complicated model is necessary (i.e. do individuals grow at roughly the same rate or not).

Next, change your method to a mixed model. You have two general options: 2l.pan assumes variance is homogeneous within class, 2l.norm allows heterogeneous variance. Again, you need to read up and check your data (e.g. run a mixed model and see if residuals are roughly homogeneous). 2l.pan is the simpler model.

https://www.rdocumentation.org/packages/mice/versions/3.6.0/topics/mice.impute.2l.pan https://www.rdocumentation.org/packages/mice/versions/3.6.0/topics/mice.impute.2l.norm

# 2l.norm mixed model (heterogenous within group variance) 2l.pan (homogenous within group variance)
#Work on method
meth1<-ini$method
meth1[which(meth1 == "pmm")] <- "2l.pan"

imputation.df<-mice(test.df,m=5,seed = 66666, method = meth1, predictorMatrix = pred1)

The higher correlation between observations within an individual is taken into account with this method. Total variance is split into variance at the ID or person level and variance at the year or observation level.

Notice that I also changed the number of datasets from m = 1 to m = 5. mice is meant for computing multiple imputations, resulting in multiple datasets. Each dataset will be slightly different, and the variance between imputations is used to reflect uncertainty about the true value underlying the missing data. If you only impute one dataset you don't get this advantage.

Since the imputation models are more complicated, they take longer to run, but the error no longer occurs and your imputation method represents your data structure better (hopefully leading to more accurate imputations).

 iter imp variable
  1   1  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   2  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   3  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   4  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  1   5  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  2   1  x1  x2  x3  x4  x5  x6  x7  x8  x9  x10  x11
  2   2  x1  x2  x3  x4  x5

For multilevel modelling I'd suggest the book Multilevel Analyses by Snijders and Bosker. The mice manual also contains some information https://www.jstatsoft.org/article/view/v045i03

Niek
  • 1,594
  • 10
  • 20
  • 1
    Hey Niek, nice answer. @Juan got the error from the ID and year. But good research to actually include that in mice. – StupidWolf Nov 20 '19 at 16:27
  • @Niek, will take a look at your answer in a bit. Thanks! – Juan Nov 20 '19 at 16:46
  • 1
    @StupidWolf thx :). Makes sense that the error disappears when you remove the variables that describe the longitudinal structure of the data. Treating the observations as independent would yield downward biased standard errors in the imputation process and any subsequent analyses (see McCoach & Adleson, 2010). This problem is exacerbated when doing single imputation instead of multiple imputation (see Graham, 2009). The main problem is that both approaches will lead to inflated type 1 error rates in subsequent hypotheses tests. – Niek Nov 21 '19 at 14:40
  • 1
    @Juan, Final sidenote :), since you have very few missing observations (at worst 1.3% in x11) maybe deletion is not such a bad idea. It is easier than imputation and bias and power issues are probably not that bad. I definitely think using multilevel multiple imputation would give the best estimates, but it might be a bit overkill for the problem you're trying to solve. – Niek Nov 21 '19 at 15:19