0

I have at my disposal categorical answers retrieved for a job application. I transformed those into dummy variables. I have in total around 60 columns and 2000 rows. I have a target column which describes the status (hired, rejected, hired by other company...).

My goal is to output a score which is high for hired people and low for rejected people based upon the variables.

I first want to try clustering method to perhaps structure the data better.

Do we need to standarize binary variables? What about if there is a mix of binary and continuous variables? I first tried to do PCA. But the principal components don't explain much variability (around 3-4%). What can I deduce from this? I then thought about applying clustering method to identify similar profiles from the answers. From what I've read, Kmeans is not an appropriate method to handle binary variables. I would like to know if hierarchical clustering method using hamming or jaccard distances would be appropriate for the kind of data i have.

Finally to achieve my original goal, would a linear regression be a good solution?

0 Answers0