1

I am trying to use GridSearchCV and RandomizedSearchCV to find the best parameters for two unsupervised learning algorithms (for novelty detection), namely, OneClassSVM and LocalOutlierFactor both by sklearn.

Below is the function that I wrote (made modifications to this example):

def gridsearch(clf, param_dist_rand, param_grid_exhaustive, X):


    def report(results, n_top=3):
       for i in range(1, n_top + 1):
           candidates = np.flatnonzero(results['rank_test_score'] == i)
           for candidate in candidates:
               print("Model with rank: {0}".format(i))
               print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                results['mean_test_score'][candidate],
                results['std_test_score'][candidate]))
               print("Parameters: {0}".format(results['params'][candidate]))
               print("")

     n_iter_search = 20
     random_search = RandomizedSearchCV(clf, 
     param_distributions=param_dist_rand, n_iter=n_iter_search, cv=5, 
     error_score=np.NaN, scoring='accuracy')

      start = time()
      random_search.fit(X)
      print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
      report(random_search.cv_results_)


      grid_search = GridSearchCV(clf, param_grid=param_grid_exhaustive, 
      cv=5, error_score=np.NaN, scoring='accuracy')
      start = time()
      grid_search.fit(X)

      print("GridSearchCV took %.2f seconds for %d candidate parameter 
      settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
      report(grid_search.cv_results_)

To test the function above I have the following code:

X, W = train_test_split(all_data, test_size=0.2, random_state=42)
clf_lof = LocalOutlierFactor(novelty=True, contamination='auto')
lof_param_dist_rand = {'n_neighbors': np.arange(6, 101, 1), 'leaf_size': 
                      np.arange(30, 101, 10)}
lof_param_grid_exhaustive = {'n_neighbors': np.arange(6, 101, 1), 
                           'leaf_size': np.arange(30, 101, 10)}
gridsearch(clf=clf_lof, param_dist_rand=lof_param_dist_rand, 
param_grid_exhaustive=lof_param_grid_exhaustive, X=X)


clf_svm = svm.OneClassSVM()
svm_param_dist_rand = {'nu': np.arange(0, 1.1, 0.01), 'kernel': ['rbf', 
                     'linear','poly','sigmoid'], 'degree': np.arange(0, 7, 
                      1), 'gamma': scipy.stats.expon(scale=.1),}
svm_param_grid_exhaustive = {'nu': np.arange(0, 1.1, 0.01), 'kernel': 
                            ['rbf', 'linear','poly','sigmoid'], 'degree': 
                            np.arange(0, 7, 1), 'gamma': 0.25}
gridsearch(clf=clf_svm, param_dist_rand=svm_param_dist_rand, 
param_grid_exhaustive=svm_param_grid_exhaustive, X=X)

Initially, I did not include set the scoring parameter for both GridSearch methods and I got this error:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method.

I then added scoring='accuracy' since I want to use the test accuracy to judge the performance of the different model configurations. Now I am getting this error:

TypeError: __call__() missing 1 required positional argument: 'y_true'

I do not have labels since I have data from one class and none from counter classes so I do not know how to go about this error. Additionally, I looked at what as suggested in this question but it did not help me. Any help would be highly appreciated.

Edit: As per @FChm suggestion of providing sample data, please find the sample .csv data file here. A short description of the file: Consists of four columns of features (PCA generated) that I feed into my models.

TMK
  • 23
  • 6
  • People may be more likely to help if you provide a link to some example data. – FChm Mar 04 '19 at 13:12
  • 1
    @FChm I have added a file with the sample data. Thanks for the suggestion. – TMK Mar 04 '19 at 13:51
  • I think the discussion following this question might be useful: https://stackoverflow.com/questions/34611038/grid-search-for-hyperparameter-evaluation-of-clustering-in-scikit-learn In short, it is not a very intuitive choice to use GridSearchCV without y_true, because you basically do not need any cross validation anymore. – MaximeKan Mar 05 '19 at 04:11
  • Thanks @MaximeKan. I have not found an option of using GridSearch without CV. Still searching though. – TMK Mar 05 '19 at 11:09
  • There is a way suggested here: [link](https://stackoverflow.com/questions/44636370/scikit-learn-gridsearchcv-without-cross-validation-unsupervised-learning/44661188) but it's no longer applicable, I tried it. – TMK Mar 05 '19 at 11:23
  • @TMK a GridSearch without CV is basically just one or several for loops. You can find one example here : https://stackoverflow.com/questions/54936518/how-do-i-automate-the-number-of-clusters/54937191#54937191 – MaximeKan Mar 06 '19 at 03:40

0 Answers0