1

I am having some trouble understanding what mljar-supervised does with data.

I am trying to replicate the model in CatBoost standalone.

My doubts about y:

in framework.json I have:

    "preprocessing": [
        {
            "scale_y": {
                "scale": [
                    0.025919186407451625
                ],
                "mean": [
                    0.994522478250971
                ],
                "var": [
                    0.0006718042240242251
                ],
                "n_samples_seen": 75000,
                "n_features_in": 1,
                "columns": [
                    "target"
                ],
                "scale_method": "scale_normal"

My guess is that:

y is proccessed by StandardScaler saving mean, scale and var and then is sent to the regressor.

The regressor when using predict returns y_predicted which is processed by StandardScaler with inverse_transform using the saved mean, scale and var?

My doubts about X:

in data_info.json I have:

    "rows": 100000,
    "cols": 143,
    "target_is_numeric": true,
    "columns_info": {
        "a0": [
            "scale"
        ],
        "b0": [
            "scale"
        ],
        "c0": [
            "scale"
        ],
        "d0": [
            "scale"
        ],

What does "scale" do? also StandardScaler?

So X is proccessed by StandardScaler and then sent to the regressor?

So, the entire proccess ... is like this?:

  1. X preprocessing, StandardScaler
  2. y preprocessing, StandardScaler, mean, scale and var saved
  3. regression training with X transformed and y transformed
  4. predict with y transformed and the results inverse_transform to give the results?

My code so far is:

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

y_scaled = scaler.fit_transform(y.values.reshape(-1, 1))

y_mean = scaler.mean_
y_scale = scaler.scale_
y_var = scaler.var_

dfx[dfx.columns] = scaler.fit_transform(dfx[dfx.columns])

X_train = dfx.iloc[0:int(dfx.shape[0]*0.75),:]

y_train = y_scaled[0:int(y.shape[0]*0.75)]

eval_pool = Pool(
    dfx.iloc[-int(dfx.shape[0]*0.10):,:], 
    y_scaled[-int(y.shape[0]*0.10):]
)

model = CatBoostRegressor(iterations=10000, learning_rate=0.1, depth=6)

model.fit(X_train, y_train, eval_set=eval_pool, plot=True, verbose=False, early_stopping_rounds=50)

But the results are very different from the results returned from mljar-supervised

Paul
  • 181
  • 4
  • 11

0 Answers0