2

I need to write a custom random_selection (for random selection of feature i.e "max_feature" and subset of train data i.e. "subsample") module in scikit-learn to be used with sklearn.ensemble.RandomForestClassifier and GradientBoostingClassifier. Can someone point to some example/documentation/discussion etc. ? Idea is to stratify using one column (not dependent i.e. Y) from train data for bagging in RandomForestClassifier

dgomzi
  • 106
  • 1
  • 14

1 Answers1

0

It seems like you have two main options here:

  1. You could iterate through the learner manually. It'll be super slow but you can feed the sampled data manually.

or 2. You could weight the samples by the inverse of the class proportion (e.g. if your data is like [a, a, b, b, b] then the sample weights would be [5/2, 5/2, 5/3, 5/3, 5/3] or something like that. That way the total contribution to the loss is equal for each value of that variable. You would do that by feeding the weights in to model.fit(X, y, sample_weight=sample_weight).

Aaron
  • 801
  • 6
  • 13