1

I have a data set with six variables and about 5000 observations (rows). The variables are:

year, day_of_week, age, gender, race, state

Variables year and age are integers; the rest are factors. This data has some missing values. To replace them, I'm using mice as follows:

  impData <- mice(obs_subset, m = 3)

The call to mice() begins processing and outputting status until it reaches the state variable, then this:

Error in augment(y, ry, x, wy) : 
  Maximum number of categories (50) exceeded

Based on my research, it looks as though there are too many possible values for the state variable for the mice function to accommodate. Sure enough, there are 51 possible values (50 states and PR). If I exclude state, mice() runs fine.

My question is: how do I impute character data with high cardinality? Variables with low cardinality such as Male/Female/NB or Mon/Tue/Wed are easier. So what's the strategy for imputing when the number of possible value goes up, maybe way up? Thank you.

JRomeo
  • 171
  • 1
  • 8
  • Did you try different imputation methods for `state`? I guess the default for categorical data with more than two values is `polyreg`. Maybe random forest (`rf`) or classification trees (`cart`) will work? They are time-consuming, though. – benimwolfspelz Nov 23 '20 at 14:16

0 Answers0