How to get sample split probability values from tree-based models - esp via the h2o framework

Question

Following the proposed tree interpreter approach (http://blog.datadive.net/interpreting-random-forests/) one can explain a tree-based model prediction using info from the decision path.

I've built tree models with H2o and exported them as PMML to do so. However, only the terminal nodes contain the probability scores, but not the branching nodes which are needed for the tree interpreter approach.

I've tested with packages from R (rpart, randomForest) and python (sklearn) but it seems they tend not to store the split info in the resulting model. So far only BigML seems to produce the needed PMML structure.

Do you know which other libraries I can try? What is the workaround strategy to compute sample split values and then generate a correesponding PMML file?

Thanks K

score 0 · Answer 1 · answered Jun 07 '23 at 05:39

The PMML converter can only use information that was made available by the original ML framework. If partial predictions are not available for intermediate tree levels, then this is so because the original ML framework did not store this information in model dump file (the in-memory representation and the dumped representation are sometime different).

Now, the information about intermediate tree levels is typically omitted, because this is useless for most application scenarios - when making predictions, then the prediction is computed exclusively based on terminal nodes.

With probability distribution-type trees and their ensembles there are two sub-types. Some models present this information in absolute terms (class data record counts), whereas other model types present it in relative terms (class probabilities).

If you are working with any model that has record counts available for terminal nodes, then you can re-construct record counts for all higher levels by aggregating them. Continue with the aggregation up until the root tree level - the sum of calculated record counts will equal to the size of the training dataset there.

How to get sample split probability values from tree-based models - esp via the h2o framework

1 Answers1