I have a data set with six variables and about 5000 observations (rows). The variables are:
year, day_of_week, age, gender, race, state
Variables year
and age
are integers; the rest are factors. This data has some missing values. To replace them, I'm using mice as follows:
impData <- mice(obs_subset, m = 3)
The call to mice() begins processing and outputting status until it reaches the state
variable, then this:
Error in augment(y, ry, x, wy) :
Maximum number of categories (50) exceeded
Based on my research, it looks as though there are too many possible values for the state
variable for the mice function to accommodate. Sure enough, there are 51 possible values (50 states and PR). If I exclude state
, mice() runs fine.
My question is: how do I impute character data with high cardinality? Variables with low cardinality such as Male/Female/NB or Mon/Tue/Wed are easier. So what's the strategy for imputing when the number of possible value goes up, maybe way up? Thank you.