2

I have an XGBoost binary classifier model trained in Python.

I would like to produce outputs from this model for a new input data, in a different scripting environment (MQL4), using pure mathematical operations and without using XGBoost library (.predict).

Can anyone help with the formula and/or algorithm?

user3666197
  • 1
  • 6
  • 50
  • 92
Gursel Karacor
  • 1,137
  • 11
  • 21
  • A couragefull idea. Yes, **`xgboost`** is a "sexy"-tagged engine with a still growing popularity these days. So, **what have you tried so far** and **how did your model-integration work in MQL4?** You might have already noticed that MQL4 is **not** a scripting environment, but a compiled language with a code-execution relying on a dynamically linked libraries. – user3666197 Dec 02 '16 at 12:51
  • I have an ANN model working with MQL4, could not integrate XGBoost yet. I have some expertise with ANNs, but not that good at XGBoost. I believe this could be done using four arithmetical operations and some additional math functions which all platforms have, just as I did with ANNs. – Gursel Karacor Dec 02 '16 at 13:34
  • There are no doubts about model execution ( a form of `aClassXgbGUESS = aTrainedXgbMODEL.predict( aFeatureVEC );` on MQL4 side ), however there are many design-side issues on transfering `aTrainedXgbMODEL` from python xgboost-native environment into MQL4 code-execution environment ( and not speaking about a need for the model running it's online learning extensions ). So production grade system is quite demanding from the distributed architecture point of view, not from the *(cit.:)* "four arithmetical operations" or "the formula and/or algorithm" perspective. – user3666197 Dec 02 '16 at 14:14
  • Is this a serious Programme? In other words, **how many man*months** of the ( arch + { prototype | rc | prod }-{ design + dev + test } + release + doc ) efforts **does your integration phase have budgeted?** – user3666197 Dec 02 '16 at 14:17

2 Answers2

2

After some reverse engineering, I found out how; once the model is trained dump your model into a text file first:

num_round = 3
bst = xgb.train( param, dtrain, num_round, watchlist )    
bst.dump_model('D:/Python/classifyproduct.raw.txt')

Then for each booster find the leaf probabilities for the input feature set. Sum all these probabilities and in our case, feed into binary logistic function:

1/(1+exp(-sum))

This is the output probability of the trained xgboost model for a given input feature set. As for an example, my sample dump with 2 inputs (a and b) text file was:

booster[0]:
0:[b<-1] yes=1,no=2,missing=1
1:[a<0] yes=3,no=4,missing=3
    3:[a<-2] yes=7,no=8,missing=7
        7:leaf=0.522581
        8:[b<-3] yes=13,no=14,missing=13
            13:leaf=0.428571
            14:leaf=-0.333333
    4:leaf=-0.54
2:[a<2] yes=5,no=6,missing=5
    5:[a<-8] yes=9,no=10,missing=9
        9:leaf=-0.12
        10:leaf=-0.56129
    6:[b<2] yes=11,no=12,missing=11
        11:leaf=-0.495652
        12:[a<4] yes=15,no=16,missing=15
            15:[b<7] yes=17,no=18,missing=17
                17:leaf=-0.333333
                18:leaf=0.333333
            16:leaf=0.456
booster[1]:
0:[b<-1] yes=1,no=2,missing=1
1:[a<0] yes=3,no=4,missing=3
    3:[b<-3] yes=7,no=8,missing=7
        7:leaf=0.418665
        8:[a<-3] yes=13,no=14,missing=13
            13:leaf=0.334676
            14:leaf=-0.282568
    4:leaf=-0.424174
 2:[a<2] yes=5,no=6,missing=5
    5:[b<0] yes=9,no=10,missing=9
        9:leaf=-0.048659
        10:leaf=-0.445149
    6:[b<2] yes=11,no=12,missing=11
        11:leaf=-0.394495
        12:[a<5] yes=15,no=16,missing=15
            15:[b<7] yes=17,no=18,missing=17
                17:leaf=-0.330064
                18:leaf=0.333063
            16:leaf=0.392826
booster[2]:
0:[b<-1] yes=1,no=2,missing=1
1:[a<0] yes=3,no=4,missing=3
    3:[b<-3] yes=7,no=8,missing=7
        7:leaf=0.356906
        8:[a<-3] yes=13,no=14,missing=13
            13:leaf=0.289085
            14:leaf=-0.245992
    4:leaf=-0.363819
 2:[a<4] yes=5,no=6,missing=5
    5:[a<2] yes=9,no=10,missing=9
        9:[b<0] yes=15,no=16,missing=15
            15:leaf=-0.0403689
            16:leaf=-0.381402
        10:[b<7] yes=17,no=18,missing=17
            17:leaf=-0.307704
            18:leaf=0.239974
    6:[b<2] yes=11,no=12,missing=11
        11:leaf=-0.308265
        12:leaf=0.302142

I have 2 features as inputs. Let us say we have [4, 9] as an input. We can calculate the booster probabilities as:

booster0 : 0.456
booster1 : 0.333063
booster2 : 0.302142
sum = 1.091205
1/(1+exp(-sum)) = 0.748608563

And that's it.

Gursel Karacor
  • 1,137
  • 11
  • 21
2

I know this is an old thread, but in case someone looks here, there is a module called m2cgen (model to code gen) which can generate purely native code from a model, including xgboost (using the gbtree booster).

https://github.com/BayesWitnesses/m2cgen

eafpres
  • 171
  • 2
  • 9
  • Yes you can check my response and sample code in answer to a similar post of mine: https://stackoverflow.com/a/59511766/7164176 – Gursel Karacor Jan 18 '20 at 10:00