DataType of InputField is double although in the PMMLPipeline it is string

Question

I am exporting a PMMLPipeline with a categorical string feature day_of_week as a PMML file. When I open the file in Java and list the InputFields I see that the data type of day_of_week field is double:

InputField{name=day_of_week, fieldName=day_of_week, displayName=null, dataType=double, opType=categorical}

Hence when I evaluate an input I get the error:

org.jpmml.evaluator.InvalidResultException: Field "day_of_week" cannot accept user input value "tuesday"

On the Python side the pipeline works with a string column:

data = pd.DataFrame(data=[{"age": 10, "day_of_week": "tuesday"}])
y = trained_model.predict(X=data)

Miminal example for creating the PMML file:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

if __name__ == '__main__':

    data_dict = {
        'age': [1, 2, 3],
        'day_of_week': ['monday', 'tuesday', 'wednesday'],
        'y': [5, 6, 7]
    }

    data = pd.DataFrame(data_dict, columns=data_dict)

    numeric_features = ['age']
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())])

    categorical_features = ['day_of_week']
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('numerical', numeric_transformer, numeric_features),
            ('categorical', categorical_transformer, categorical_features)])

    pipeline = PMMLPipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('classifier', RandomForestRegressor(n_estimators=60))])

    X = data.drop(labels=['y'], axis=1)
    y = data['y']

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=30)

    trained_model = pipeline.fit(X=X_train, y=y_train)
    sklearn2pmml(pipeline=pipeline, pmml='RandomForestRegressor2.pmml', with_repr=True)

EDIT: sklearn2pmml creates a PMML file with A DataDictionary with DataField "day_of_week" that has dataType="double". I think it should be "String". Do I have to set the dataType somewhere to correct this?

<DataDictionary>
    <DataField name="day_of_week" optype="categorical" dataType="double">

score 3 · Accepted Answer · answered May 25 '20 at 11:27

3

You can assist SkLearn2PMML by providing "feature type hints" using sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain decorators (see here for more details).

In the current case, you should prepend a CategoricalDomain step to the pipeline that deals with categorical features:

from sklearn2pmml.decoration import CategoricalDomain

categorical_transformer = Pipeline(steps=[
    ('domain', CategoricalDomain(dtype = str))
    ('onehot', OneHotEncoder(handle_unknown='ignore', categories='auto'))
])

answered May 25 '20 at 11:27

user1808924

4,563
2
17
20

Ok, it works. Is there any preference of using `CategoricalDomain` over `OrdinalEncoder`? I was testing this: `('ordinal', OrdinalEncoder(categories='auto'))` – Leevi L May 25 '20 at 14:16
"Ordinal" means "Ordered Categorical". While days of week (Mon, Tue, Wed) can be regarded as ordinal, it's better to keep it simple and stick with categorical. Besides, the Scikit-Learn framework does not have the concept of ordinal features, so this extra details goes wasted anyway. – user1808924 May 25 '20 at 15:18
Hello, I am too working on this. I faced the same problem while applying `OrdinalEncoder` in the pipeline. The datatype="Double" was there as this stack suggests. And if this is the problem and if you need to use the `CategoricalDomain` as an extra step, then isn't this an issue already? In java it fails. – Aayush Shah Aug 19 '22 at 07:08

score 1 · Answer 2 · answered Aug 19 '22 at 07:34

Thanks for your reply @user1808924. The given solution works. Now, to add in his answer; I would like to note that CategoricalDomain works for the single feature only.

Problem:

So, when you use it in to pipeline like:

# pipeline creatiion
categorical_transformer = Pipeline(steps=[
    ('domain', CategoricalDomain(dtype = str)),
    ('onehot', Ordinalecndoer())
])

# fit and transform of `df` with 3 features
categorical_transformer.fit_transform(df)

### >>> ERROR: Expected 1d array, got 2d array of shape (1000, 3)

Which means you will need to use multiple CategoricalDomains in there.

NOTE: We often use it in the ColumnTransformer as well. You need to know how many categorical features are there before hand.

What can we do?

We will simply use the MultiDomain from the same library.

from sklearn2pmml.decoration import MultiDomain

categorical_transformer = Pipeline(steps=[
    ('domain', MultiDomain([CategoricalDomain(dtype = str) for _ in range(3)])),
    ('onehot', OrdinalEncoder())
])

Note that the 3 is the number of categorical columns there. Hence, there will be n CategoricalDomains per categorical columns.

Then performing the transformation will work.

DataType of InputField is double although in the PMMLPipeline it is string

2 Answers2

Problem:

What can we do?