If we look at the source code, RandomForestClassifier
is sub-classed from ForestClassifier
class, which in turn is sub-classed from BaseForest
class and the fit()
method is actually defined the BaseForest
class. As OP pointed out, the interaction between class_weight
and sample_weight
determine the sample weights used to fit each decision tree of the random forest.
If we inspect _validate_y_class_weight()
, fit()
and _parallel_build_trees()
methods, we can understand the interaction between class_weight
, sample_weight
and bootstrap
parameters better. In particular,
- if
class_weight
is passed to the RandomForestClassifier()
constructor but no sample_weight
is passed to fit()
, class_weight
is used as the sample weight
- if both
sample_weight
and class_weight
are passed, then they are multiplied together to determine the final sample weights used to train each individual decision tree
- if
class_weight=None
, then sample_weight
determines the final sample weights (by default, if None, then samples are equally weighted).
The relevant part in the source code may be summarized as follows.
from sklearn.utils import compute_sample_weight
if class_weight == "balanced_subsample" and not bootstrap:
expanded_class_weight = compute_sample_weight("balanced", y)
elif class_weight is not None and class_weight != "balanced_subsample" and bootstrap:
expanded_class_weight = compute_sample_weight(class_weight, y)
else:
expanded_class_weight = None
if expanded_class_weight is not None:
if sample_weight is not None:
sample_weight = sample_weight * expanded_class_weight
else:
sample_weight = expanded_class_weight
With bootstrap=True
, observations are randomly selected for individual trees trained, which is done via the sample_weight
argument of fit()
whose relevant (abridged) code looks like the following.
if bootstrap:
if sample_weight is None:
sample_weight = np.ones((X.shape[0],), dtype=np.float64)
indices = check_random_state(tree.random_state).randint(X.shape[0], n_samples_bootstrap)
sample_counts = np.bincount(indices, minlength=X.shape[0])
sample_weight *= sample_counts
if class_weight == "balanced_subsample":
sample_weight *= compute_sample_weight("balanced", y, indices=indices)