I am trying to use GridSearchCV
and RandomizedSearchCV
to find the best parameters for two unsupervised learning algorithms (for novelty detection), namely, OneClassSVM
and LocalOutlierFactor
both by sklearn.
Below is the function that I wrote (made modifications to this example):
def gridsearch(clf, param_dist_rand, param_grid_exhaustive, X):
def report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")
n_iter_search = 20
random_search = RandomizedSearchCV(clf,
param_distributions=param_dist_rand, n_iter=n_iter_search, cv=5,
error_score=np.NaN, scoring='accuracy')
start = time()
random_search.fit(X)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
grid_search = GridSearchCV(clf, param_grid=param_grid_exhaustive,
cv=5, error_score=np.NaN, scoring='accuracy')
start = time()
grid_search.fit(X)
print("GridSearchCV took %.2f seconds for %d candidate parameter
settings."
% (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)
To test the function above I have the following code:
X, W = train_test_split(all_data, test_size=0.2, random_state=42)
clf_lof = LocalOutlierFactor(novelty=True, contamination='auto')
lof_param_dist_rand = {'n_neighbors': np.arange(6, 101, 1), 'leaf_size':
np.arange(30, 101, 10)}
lof_param_grid_exhaustive = {'n_neighbors': np.arange(6, 101, 1),
'leaf_size': np.arange(30, 101, 10)}
gridsearch(clf=clf_lof, param_dist_rand=lof_param_dist_rand,
param_grid_exhaustive=lof_param_grid_exhaustive, X=X)
clf_svm = svm.OneClassSVM()
svm_param_dist_rand = {'nu': np.arange(0, 1.1, 0.01), 'kernel': ['rbf',
'linear','poly','sigmoid'], 'degree': np.arange(0, 7,
1), 'gamma': scipy.stats.expon(scale=.1),}
svm_param_grid_exhaustive = {'nu': np.arange(0, 1.1, 0.01), 'kernel':
['rbf', 'linear','poly','sigmoid'], 'degree':
np.arange(0, 7, 1), 'gamma': 0.25}
gridsearch(clf=clf_svm, param_dist_rand=svm_param_dist_rand,
param_grid_exhaustive=svm_param_grid_exhaustive, X=X)
Initially, I did not include set the scoring
parameter for both GridSearch
methods and I got this error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method.
I then added scoring='accuracy'
since I want to use the test accuracy to judge the performance of the different model configurations. Now I am getting this error:
TypeError: __call__() missing 1 required positional argument: 'y_true'
I do not have labels since I have data from one class and none from counter classes so I do not know how to go about this error. Additionally, I looked at what as suggested in this question but it did not help me. Any help would be highly appreciated.
Edit:
As per @FChm suggestion of providing sample data, please find the sample .csv
data file here. A short description of the file:
Consists of four columns of features (PCA generated) that I feed into my models.