0

Here is the help of sklearn.ensemble.RandomForestClassifier.fit(). It is not clear whether there can be a problem when X and y are sorted by labels. My preliminary test suggests that it does not matter whether X and y are sorted.

Is my conclusion correct?

Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
...
 |      Build a forest of trees from the training set (X, y).
 |
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The training input samples. Internally, its dtype will be converted
 |          to ``dtype=np.float32``. If a sparse matrix is provided, it will be
 |          converted into a sparse ``csc_matrix``.
 |
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          The target values (class labels in classification, real numbers in
 |          regression).
halfer
  • 19,824
  • 17
  • 99
  • 186
user1424739
  • 11,937
  • 17
  • 63
  • 152
  • Destroying such orderings by shuffling the data is *always* a good idea, for very general reasons that have nothing to do with RF in particular; see own answers in [Data shuffling for Image Classification](https://stackoverflow.com/a/61218707/4685471), [Confusion Matrix : Shuffle vs Non-Shuffle](https://stackoverflow.com/a/61227567/4685471), and [Cross_val_score is not working with roc_auc and multiclass](https://stackoverflow.com/a/55309222/4685471) – desertnaut Jun 30 '21 at 18:17
  • I am particularly interested in RF. I would like to understand whether proper working of RF requires random ordered input or sorted input still works equally well. None of the links that you mentioned prove that `sklearn.ensemble.RandomForestClassifier.fit` requires shuffling of the input. – user1424739 Jun 30 '21 at 18:44
  • Because in itself it does not, as already implied; but there are dozens of other reasons why you should do so. And that's why shuffling is routinely included in almost all ML pipelines, regardless of the algorithm used. – desertnaut Jun 30 '21 at 19:51
  • On the other hand, whet *exactly* are you asking here? What is the reason to believe that the algorithm *may* have issues when the data are ordered by label? There is nothing in your post to even remotely suggest that this might be the case, so what kind of confirmation are you after? – desertnaut Jun 30 '21 at 20:57
  • Why is this question got closed? I think this is perfect clear question to ask. And there is already an answer. It makes no sense to close this question as unclear. – user1424739 Jul 01 '21 at 00:21

1 Answers1

2

It does not matter in the case of RandomForestClassifier.

Random forest is an ensemble of weak learners that perform a majority voting.

As we need to have different trees that take their decisions based on different features, the algorithm is using Bootstrapping (argument bootstrap=True inRandomForestClassifier) which is performing a random sampling with replacement. In addition to the bootstrap samples, we also draw random subsets of features for training the individual trees

Bootsrapping is essential to Random Forest. Without it, all trees would be more or less similar and based on the same features. This would destroy the whole purpose of the majority voting.

Therefore we can say that the order of the samples do not matter. However, as desertnaut said in their comment, it is always better to shuffle the data to avoid other potential problems.

Note: Statquest videos on the subject are really nice to understand how it works in depth.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29