I am having some trouble understanding what mljar-supervised does with data.
I am trying to replicate the model in CatBoost standalone.
My doubts about y:
in framework.json I have:
"preprocessing": [
{
"scale_y": {
"scale": [
0.025919186407451625
],
"mean": [
0.994522478250971
],
"var": [
0.0006718042240242251
],
"n_samples_seen": 75000,
"n_features_in": 1,
"columns": [
"target"
],
"scale_method": "scale_normal"
My guess is that:
y is proccessed by StandardScaler saving mean, scale and var and then is sent to the regressor.
The regressor when using predict
returns y_predicted which is processed by StandardScaler with inverse_transform using the saved mean, scale and var?
My doubts about X:
in data_info.json I have:
"rows": 100000,
"cols": 143,
"target_is_numeric": true,
"columns_info": {
"a0": [
"scale"
],
"b0": [
"scale"
],
"c0": [
"scale"
],
"d0": [
"scale"
],
What does "scale" do? also StandardScaler?
So X is proccessed by StandardScaler and then sent to the regressor?
So, the entire proccess ... is like this?:
- X preprocessing, StandardScaler
- y preprocessing, StandardScaler, mean, scale and var saved
- regression training with X transformed and y transformed
- predict with y transformed and the results inverse_transform to give the results?
My code so far is:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
y_scaled = scaler.fit_transform(y.values.reshape(-1, 1))
y_mean = scaler.mean_
y_scale = scaler.scale_
y_var = scaler.var_
dfx[dfx.columns] = scaler.fit_transform(dfx[dfx.columns])
X_train = dfx.iloc[0:int(dfx.shape[0]*0.75),:]
y_train = y_scaled[0:int(y.shape[0]*0.75)]
eval_pool = Pool(
dfx.iloc[-int(dfx.shape[0]*0.10):,:],
y_scaled[-int(y.shape[0]*0.10):]
)
model = CatBoostRegressor(iterations=10000, learning_rate=0.1, depth=6)
model.fit(X_train, y_train, eval_set=eval_pool, plot=True, verbose=False, early_stopping_rounds=50)
But the results are very different from the results returned from mljar-supervised