9

The Python API doesn't give much more information other than that the seed= parameter is passed to numpy.random.seed:

seed (int) – Seed used to generate the folds (passed to numpy.random.seed).

But what features of xgboost use numpy.random.seed?

  • Running xgboost with all default settings still produces the same performance even when altering the seed.
  • I have already been able to verify colsample_bytree does so; different seeds yield different performance.
  • I have been told it is also used by subsample and the other colsample_* features, which seems plausible since any form of sampling requires randomness.

What other features of xgboost rely on numpy.random.seed?

gosuto
  • 5,422
  • 6
  • 36
  • 57

2 Answers2

6

Boosted trees are grown sequentially, with tree growth within one iteration being distributed among threads. To avoid overfitting, randomness is induced through the following params:

  • colsample_bytree
  • colsample_bylevel
  • colsample_bynode
  • subsample (note the *sample* pattern)
  • shuffle in CV folder creation for cross validation

In addition, you may encounter non-determinism, not controlled by random state, in the following places:

[GPU] histogram building is not deterministic due to the nonassociative aspect of floating point summation.

Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm

when using GPU ranking objective, the result is not deterministic due to the non-associative aspect of floating point summation.

Comment Re: how you know this?

For this to know it's helpful:

  1. To be aware of how trees are grown: Demystify Modern Gradient Boosting Trees (references may be also helpful)

  2. Scanning documentation full text for the terms of interest: random, sample, deterministic, determinism etc.

  3. Lastly (firstly?), knowing why you need sampling and similar cases from counterparts like bagged trees (RANDOM FORESTS by Leo Breiman) and neural networks (Deep learning with Python by François Chollet, chapter on overfitting) may also be helpful.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
0

Well, if you want the exhaustive list, you could look at the source at GitHub. Searching with keywords on github gives good insignt.

search for 'rand' - 15 results

search for 'seed' and python filter - 20 results

manju-dev
  • 434
  • 2
  • 9
  • 1
    This is a start. Problem is that the Python code is just a wrapper around the core which is written in C++ by the looks of it. Doing it this way would require vast knowledge of the source code. I was hoping for a less elaborate solution. – gosuto Dec 31 '20 at 18:25
  • I have not used this library but typically the library functions that use randomness will have some arguments to control the behaviour. If the function you are using doesn't have any such arguments listen then I think it is safe to assume it doesn't use randomness. You can test this by quickly calling your functions with the same output multiple times and see if it returns the same output. – manju-dev Jan 01 '21 at 05:52