I've got a survey where a lot of people are randomly asked 8 out of 20 policy-based questions. I want to use MDS to bring these questions down to a single dimension to get the ideology of each of these respondents. However, because people are only asked a few questions, I can't get a dissimilarity matrix between each respondent because very few are asked the same 8 questions. I also can't remove rows with NA, because every row has 12 NAs. I have two options:
Create a regression on every one of the 20 variables with values of other questions in the survey asked to every participant (age, gender, etc), and impute the NAs based on those variables.
Use some sort of MDS method that doesn't require a complete matrix.
So far, I've been working with the first one, but the created models aren't always the best. Since the policy questions are yes-no, I called a binomial glm on each model:
complete <- function(x){
q_and_predictors <- data.frame(question = x, predictors)
logistic_reg <- glm(question ~ ., data = q_and_predictors, family = "binomial")
predictions <- predict(logistic_reg, newdata = predictors)
x <- ifelse(is.na(x), exp(predictions)/(1 + exp(predictions)), x)
return (x)
}
complete_questions <- apply(questions, 2, complete)
The questions dataframe contains all the policy questions, and the predictors dataframe contains all non-policy questions.
I found the McFadden R^2 value for each logistic model, and some were very good (>0.35), but some were not (<0.1). Ideally, I'd like to find a way to either impute the missing values with greater accuracy, or use an MDS algorithm that works with missing values.