0

I'm trying to build a prediction model in auto-sklearn with 10 fold cross validation. My dataset has about 40k rows and 80 features. Here is my code (where X are my features and y is the continuous outcome variable):

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=3600, per_run_time_limit=600, 
    resampling_strategy='cv', 
    resampling_strategy_arguments={'folds': 10})
automl.fit(X, y, dataset_name='unused', feat_type=feature_types)
automl.refit(X.copy(), y.copy())
automl.cv_results_

The output from the last line is little confusing to me

{'mean_fit_time': array([6.00111840e+02, 1.76325102e+01, 1.68442428e+01, 
1.68408656e+00, 
9.08970833e-01, 1.73636928e+01, 5.83850384e-01, 8.99704933e-01,
    1.77676334e+01, 8.56771708e-01, 1.58957437e+02, 6.00050516e+02,
    6.00073232e+02, 1.72906122e+01, 6.00116965e+02, 6.00113743e+02,
    3.24114606e+02]),
 'mean_test_score': array([0.       , 0.2108587, 0.       , 0.       , 0.       , 0.       ,
    0.       , 0.       , 0.       , 0.       , 0.2108587, 0.       ,
    0.       , 0.2108587, 0.       , 0.       , 0.       ]),

[results text is longer but I've deleted it due to character limits]

 'rank_test_scores': array([4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 1, 4, 4, 4]),
 'status': ['Timeout',   'Success',   'Memout',   'Crash',   'Memout',   'Memout',   'Crash',   'Crash',   'Crash',   'Memout',   'Success',   'Timeout',   'Crash',   'Success',   'Timeout',   'Timeout',   'Timeout']}

There is no mean_train_score and it seems that there are a lot of missings in mean_test_score. Am I doing something wrong? I get the same issue when I allow my model to run for longer. I also get a worse R2 when I run 10-fold cross validation than when I don't

Any guidance would be appreciated. Yara.

Yara
  • 1
  • 3
  • 1
    first, i would advice you to use a pandas data frame for storing cv_results. Do the following `import pandas as pd`, `pd.DataFrame(automl.cv_results_)`. Now you can write this to csv or excel using `pd.to_csv(pd.DataFrame(automl.cv_results_))` or pd.to_excel. Is there still something missing? – pythonic833 Mar 30 '18 at 19:29
  • The output using your code is pretty awesome. Thank you!! Unfortunately, this doesn't solve the issue of why the results don't contain a mean_train_score or why my R2 is lower when I use 10-fold cross validation. I wonder if maybe I should post this question directly on the auto-sklearn github... – Yara Mar 30 '18 at 19:38
  • r2 is lower if you use cross validation than if you don't? That is pretty normal the classifier does not see the test data before it is tested on it, therefore it explains less variance than if you don't use cv. – pythonic833 Mar 30 '18 at 19:41
  • So I did manually check the R2 in a test dataset (which the regressor in this case) did not see and it was not as low as 0.21. This also doesn't explain why the R2 from the training datasets are not shown. I'm just not sure what is going on. – Yara Mar 31 '18 at 00:32
  • Look at the `'status'`. Only those where value is `'Success'` will have valid values in other arrays. – Vivek Kumar Apr 06 '18 at 11:34
  • I see. I wonder how to troubleshoot this. There's not clue as to why the cross validations are failing or why there are 17 isntances rather than the 10 I was expecting using 10 folds. – Yara Apr 06 '18 at 18:38

0 Answers0