-1

I have about 8000 features measuring a two level response variable i.e. output can belong to class 1 or 0. The 8000 features consist of about 3000 features with 0-1 values and about 5000 features (which are basically words from text data and their tfidf scores.

I am building a linear svm model on this to predict my output variable and am getting decent results/ accuracy, recall and precision around 60-70%

I am looking for help with the following:

  1. Standardization: do the 0-1 values need to be standardized? Do tfidf scores need to be standardized even if I use sublinear tdf=true ?

  2. Dimension reduction: I have tried f_classif using SelectPercentile function of sklearn so far. Any other dimension reduction techniques that can be suggested? I have gone through the sklearn dimension reduction url which also talks about chi2 dim reduction but that isn't giving me good results. Can pca be applied if the data is a mix of 0-1 columns and tfidf score columns?

  3. Remove collinearity: How can I remove highly correlated independent variables.

I am fairly new to python and machine learning, so any help would be appreciated.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97

1 Answers1

1

(edited to include additional questions)

1 - I would centre and scale your variables for a linear model. I don't know if it's strictly necessary for SVMs, but if I recall correctly, spatial based models are better if the variables are in the same ranges. I don't think there's any harm in doing this anyway (vs. unscaled/uncentred). Someone may correct me - I don't do much by way of text analysis.

2 - (original answer) = Could you try applying a randomForest model, then inspecting the importance scores (discarding those with low importance). With so many features I'd worry about memory issues but if your machine can handle it...?

Another good approach here would be to use ridge/lasso logistic regression. This by its very nature is good at identifying (and discarding) redundant variables, and can help with your question 3 (correlated variables).

Appreciate you're new to this, but both these models above are good at getting around correlation / non-significant variables, so you may want to use these on the way to finalising an SVM.

3 - There's no magic bullet that I know of. The above may help. I predominantly use R, and within that there's a package called Boruta which is good for this step. There may be a Python equivalent?

Jon
  • 445
  • 3
  • 15