0

I am making a random forest multi-classifier model. Basically there are hundred of households which have 200+ features, and based on these features I have to classify them in one of the classes {1,2,3,4,5,6}.

The problem I am facing is I cannot improve the accuracy of the model how much ever I can try. I have used RandomSearchCV and also GridSearchCV but I can only achieve accuracy of around 68%.

Some points to note

  1. The sample points are unbalanced. This is the order of classes in decreasing order {1,4,2,7,6,3}. I have used class_weight = "balanced" but it does improve the accuracy.
  2. I have tried number of estimators ranging from 50-450
  3. I have also calculated the f1 score and not only going by accuracy to compare the models

What else do you guys suggest to improve the accuracy/f1-score? I am stuck with this problem from a long time. Any help will be highly appreciated.

code_crusher
  • 53
  • 2
  • 8
  • Have you tried reducing the number of features (some might be highly correlated and not giving any new info). Have you normalized your data? –  Oct 08 '18 at 14:03
  • I read that since it is a Random Forest model, it is not necessary to normalize the data. All the features are necessary but I will still try to remove some of them to see if it helps. – code_crusher Oct 08 '18 at 14:06
  • are your features quantitative, qualitative or both ? How do you encode them? do you use LabelEncoder or One Hot Encoding? – Gabriel M Oct 08 '18 at 14:58
  • The name of the classes were already 1,2,3,4,5,6 so I am not using any kind of encoding. My features are both qualitative and quantitative – code_crusher Oct 08 '18 at 15:18
  • Also all the features are numerical. – code_crusher Oct 08 '18 at 15:19

2 Answers2

0

You can check if the features are on different scales. If they are, it is suggested to use some type of normalization. This step is essential for many linear-based models to perform well. You can take a quick look at the distributions of each numeric feature to decide what type of normalization to use.

Sotirios
  • 21
  • 5
0

Try to tune below parameters

n_estimators

This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower.

max_features

These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features.

min_sample_leaf

Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. You can start with some minimum value like 75 and gradually increase it. See which value your accuracy is coming high.

Otherwise:

  1. You can try XGBoost, LightGBM or Adaboost, they often perform better than Random Forest

  2. Try do not remove missing values, complex ensemble models such as RF and GBM treats it well, may be you lost some useful information doing so, especially if you have large percent of your data missing in some features

  3. Try to increase n_estimators and max_depth, may be your trees not deep enough to catch all data properties