5

I have been going through the catboost algorithm and it is hard for me to see the point of using symmetric trees. On this regard, i found in their github:

An important part of the algorithm is that it uses symmetric trees and builds them level by level. Symmetric tree is a tree where nodes of each level use the same split. This allows to encode path to leaf with an idex. For example, there is a tree with depth 2. Split on the first level is f1<2, split on the second level is f2<4. Then the object with f1=5, f2=0 will have leaf with number 01b.

They say it helps to be less prone to overfitting and to have a much quicker inference but, intuitively for me, it is like you need twice the depth to explore the same amount of splits.

So, can anybody explain which is actually the advantage of using this types of trees??

Many thanks.

1 Answers1

2

Since this is the first result in the Google search, I will provide the answer.

Typical decision trees are a series of if/else decisions. Assume you can make 1 such decision per processor cycle - this will be fast, but 100% sequential. To make a decision, you need O(m) decisions, where m is the maximal height of the tree.

In CatBoost's symmetric trees, each split is on the same attribute. To determine whether you go left or right you only need to know the current level of the tree, which corresponds to a feature and its value. This threshold is the same for all splits on that level. This way, you can vectorize your decisions - create a vector of thresholds, a vector of values you currently use for prediction and compare them element-wise. If you have a vector processor, i.e. it can perform multiple integer comparisons in parallel (which is very common nowadays), you need 1 processor cycle to make a decision.

The difference, as you can see, boils down to vectorization, being able to go directly from the root to the leaf in 1 step of vector element-wise comparison, instead of sequence of if/else decisions.

qalis
  • 1,314
  • 1
  • 16
  • 44