The DecisionTreeClassifier()
function is apparently documented here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:
clf = tree.DecisionTreeClassifier()
The parameter of interest, random_state
is documented like this:
random_state: int, RandomState instance or None, default=None
So your call is equivalent to, among many other things:
clf = tree.DecisionTreeClassifier(random_state=None)
The None
value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.
Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.
You might want to try forcing the seed. For example:
clf = tree.DecisionTreeClassifier(random_state=42)
and see if your problem persists.
Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:
According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.
The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.
Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.