I have about 8000 features measuring a two level response variable i.e. output can belong to class 1 or 0. The 8000 features consist of about 3000 features with 0-1 values and about 5000 features (which are basically words from text data and their tfidf scores.
I am building a linear svm model on this to predict my output variable and am getting decent results/ accuracy, recall and precision around 60-70%
I am looking for help with the following:
Standardization: do the 0-1 values need to be standardized? Do tfidf scores need to be standardized even if I use sublinear
tdf=true
?Dimension reduction: I have tried
f_classif
usingSelectPercentile
function of sklearn so far. Any other dimension reduction techniques that can be suggested? I have gone through the sklearn dimension reduction url which also talks about chi2 dim reduction but that isn't giving me good results. Can pca be applied if the data is a mix of 0-1 columns and tfidf score columns?Remove collinearity: How can I remove highly correlated independent variables.
I am fairly new to python and machine learning, so any help would be appreciated.