0

I am following a course on udemy about data science with python. The course is focused on the output of the algorithm and less on the algorithm by itself. In particular I am performing a decision tree. Every doing I run the algorithm on python, also with the same samples, the algorithm gives me a slightly different decision tree. I have asked to the tutors and they told me "The decision trees does not guarantee the same results each run because of its nature." Someone can explain me why more in detail or maybe give me an advice for a good book about it?

I did the decision tree of my data importing:

import numpy as np
import pandas as pd
from sklearn import tree

and doing this command:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)

where X are my feature data and y is my target data

Thank you

User_O
  • 115
  • 4
  • If you want reproducibility, you need to control the *seeding* process of the underlying pseudo-random number generator. As you don't give any details, we can't tell you more. If you don't explicitly control the seed thru your API, the system will typically peek some hash of the current time for you. For example, see point 8 here: [https://www.tutorialspoint.com/scikit_learn/scikit_learn_decision_trees.htm](https://www.tutorialspoint.com/scikit_learn/scikit_learn_decision_trees.htm) – jpmarinier Mar 01 '22 at 12:34
  • @jpmarinier I updated the question. I cannot find documentation that explain step by step how the algorithm works. Where tree.DecisionTreeClassifier has a random part – User_O Mar 01 '22 at 12:49
  • [Here is the source](https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/tree/_classes.py#L639) for the sklearn decision tree classifier. It says in part 'The features are always randomly permuted at each split, even if ``splitter`` is set to `"best"`' – John Coleman Mar 01 '22 at 13:12
  • @MardyOwens - You have to look at the full documentation of your function [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). See my answer for further details. Hope it helps. – jpmarinier Mar 01 '22 at 13:21

1 Answers1

2

The DecisionTreeClassifier() function is apparently documented here:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:

clf = tree.DecisionTreeClassifier()

The parameter of interest, random_state is documented like this:

random_state: int, RandomState instance or None, default=None

So your call is equivalent to, among many other things:

clf = tree.DecisionTreeClassifier(random_state=None)

The None value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.

Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.

You might want to try forcing the seed. For example:

clf = tree.DecisionTreeClassifier(random_state=42)

and see if your problem persists.

Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:

According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.

The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.

Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.

jpmarinier
  • 4,427
  • 1
  • 10
  • 23