0

I train GBM models with H2O and want to use them in my backend (not Java). To do so, I download the MOJOs, convert it to ONNX and run it in my apps.

In order to make inference, I need to know how categorical columns transformed to their one-hot encoded versions. I was able to find it in the POJO:

    static final void fill(String[] sa) {
      sa[0] = "Age";
      sa[1] = "Fare";
      sa[2] = "Pclass.1";
      sa[3] = "Pclass.2";
      sa[4] = "Pclass.3";
      sa[5] = "Pclass.missing(NA)";
      sa[6] = "Sex.female";
      sa[7] = "Sex.male";
      sa[8] = "Sex.missing(NA)";
    }

So, here is the workflow for non-Java backend as I see it:

  1. Encode categorical features with OneHotExplicit.
  2. Train GBM model.
  3. Download MOJO and convert to ONNX.
  4. Download POJO and find feature alignment in the source code.
  5. Implement the inference in your backend.

Is it the most straightforward and correct way?

Maxim Blumental
  • 763
  • 5
  • 26

2 Answers2

0

Thank you for your question.

Can you access the stored categorical values here?

https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoModel.java#L72

https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoReader.java#L34

https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/tree/SharedTreeMojoWriter.java#L61

The index in the array means the translated categorical value.

The EasyPredictModelWrapper did it this way:

https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/RowToRawDataConverter.java#L44

  • I am not working with Java code. I program in Python. I cannot access the sources that you mentioned. My question was: what is the correct way to find encoding of categorical values without Java. – Maxim Blumental Feb 16 '23 at 13:53
0

Can you access the model.ini inside of the zip? There is [domains] tag and under the tag is a list of files in domains/ directory which correspond the categorical encoding for each feature.

e.g:

[columns]
AGE
RACE
DPROS
DCAPS
PSA
VOL
GLEASON
CAPSULE

[domains]
7: 2 d000.txt 

means 7th column (CAPSULE) has 2 categorical variables in d000.txt

or there is a experimental/modelDetails.json file that has categorical values under output.domains. The index in the list correspond to the feature in the output.names list.

e.g output.domains[7] are domains for output.names[7] feature.