0

I have a data set that I cannot post for confidentiality reasons. I am attempting to impute missing values using aregImpute in Hmisc. However, I get an error something like this:

Error in aregImpute(): a bootstrap resample had too few unique values of the following variables:... "VARIABLE" has the following levels with < 5 observations: ... Consider using the group parameter to balance bootstrap samples

Can someone provide an example of aregImpute using the group argument so I can see how the group argument is used in aregImpute?

I don't really follow the documentation: https://www.rdocumentation.org/packages/Hmisc/versions/4.1-0/topics/aregImpute

sm925
  • 2,648
  • 1
  • 16
  • 28
sma
  • 315
  • 3
  • 6

1 Answers1

0

You should be able to say group=data$variablename where variablename is the name of one of the variables that is discrete. But in many cases it is better to collapse categories at the outset.

Frank Harrell
  • 1,954
  • 2
  • 18
  • 36
  • This does not work. How can i change this for multiple variables (as I have multiple variables that return this error)? Also when I collapse the categories for all variables with > 3 levels, I go from 288 to 800 covariates. Because I only have 700 samples I feel this isn't a good idea. From a machine learning perspective (e.g. using linear regression or decision trees with this data to predict an outcome), is it ok to collapse the variables when the sample to covariate ratio is so skewed? – sma Apr 03 '18 at 19:29
  • What is the number of predictors, sample size, distribution of the outcome variable you are predicting, and total number of candidate parameters for predicting the outcome? And note that collapsing levels doesn't mean removing the covariate altogether. [The `group` method only works for one variable or an `interaction()` of a very few variables. – Frank Harrell Apr 04 '18 at 12:06
  • I thought that collapsing was taking whatever discrete variables and decomposing them into binary outcomes for each type. For example, a var X that takes 1-7 will be collapsed into X1, X2,...X7 for whether or not X takes value 1-7. Collapsing like that yields ~800 covariates. I actually have 288 predictors, 780 samples, not sure about distribution of outcome variable (continuous value from 0-10), and technically 288 candidate predictors. Sorry if I was unable to answer the questions well, I am rather new to this. – sma Apr 04 '18 at 17:29
  • I have this error occur for several variables. Should I remove those variables that I can, that based on knowledge domain experts, do not seem like strong predictors? I began removing the variables one by one that have this error just to see if aregImpute runs, but find more variables appearing with that error. – sma Apr 04 '18 at 17:33
  • Collapsing means collapsing infrequently levels. This won't necessarily result in binary variables. Bigger picture: your sample size is too small by a factor > 5 for your approach to work. I would give up on this problem or use heavy unsupervised learning (data reduction blinded to Y) first. – Frank Harrell Apr 04 '18 at 20:04
  • Unfamiliar with PCA/SVD. Do I use the eigenvector principal components as my new training data or reconstruction of data from the principle components to predict my outcome? – sma Apr 04 '18 at 20:36
  • Also I am confused how to interpret a reduced data set. the values you have are projected down into a smaller dimensional subspace and if you were to run a model on that reduced data set, how would data and result interpretation be affected? – sma Apr 04 '18 at 20:52
  • Just as with factor analysis, you'd be interpreting effects of concepts or themes. Often data reduction is the only way to bring stability to an analysis, and the stability enhances rather than hurts interpretation. – Frank Harrell Apr 05 '18 at 12:56