I am trying out ML classification models (Logistic Regression and SVM) with different C-parameter values with Scikitlearn's breast cancer dataset:
C_values = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]
models_list2 = []
# Loop to add model per c_value for Logistic Regression
for c_value in C_values:
logR_clf_c = LogisticRegression(C=c_value, random_state=42).fit(X_train, y_train)
model = "Logistic Regression"
C_value = f"C-value: {c_value}"
train_acc = "Train accuracy: {:.3f}".format(logR_clf_c.score(X_train, y_train))
test_acc = "Test accuracy: {:.3f}".format(logR_clf_c.score(X_test, y_test))
models_list2.append([model, C_value, train_acc, test_acc])
# Loop to add model per c_value for SVM
for c_value in C_values:
svc_clf_c = LinearSVC(C=c_value, random_state=42).fit(X_train, y_train)
model = "SVM"
C_value = f"C-value: {c_value}"
train_acc = "Train accuracy: {:.3f}".format(svc_clf_c.score(X_train, y_train))
test_acc = "Test accuracy: {:.3f}".format(svc_clf_c.score(X_test, y_test))
models_list2.append([model, C_value, train_acc, test_acc])
models_list2 = sortOnTestAcc(models_list2)
print(*models_list2, sep='\n')
The output gives the following:
['Logistic Regression', 'C-value: 1000', 'Train accuracy: 0.981', 'Test accuracy: 0.972']
['Logistic Regression', 'C-value: 10', 'Train accuracy: 0.965', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 100', 'Train accuracy: 0.972', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 10000', 'Train accuracy: 0.977', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 100000', 'Train accuracy: 0.972', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 1', 'Train accuracy: 0.953', 'Test accuracy: 0.958']
['Logistic Regression', 'C-value: 0.1', 'Train accuracy: 0.944', 'Test accuracy: 0.944']
['Logistic Regression', 'C-value: 0.001', 'Train accuracy: 0.923', 'Test accuracy: 0.937']
['Logistic Regression', 'C-value: 0.01', 'Train accuracy: 0.934', 'Test accuracy: 0.930']
['SVM', 'C-value: 0.001', 'Train accuracy: 0.937', 'Test accuracy: 0.930']
['SVM', 'C-value: 0.01', 'Train accuracy: 0.934', 'Test accuracy: 0.930']
['Logistic Regression', 'C-value: 0.0001', 'Train accuracy: 0.920', 'Test accuracy: 0.923']
['SVM', 'C-value: 0.0001', 'Train accuracy: 0.927', 'Test accuracy: 0.923']
['SVM', 'C-value: 1', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 10', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 100', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 1000', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 10000', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 100000', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 0.1', 'Train accuracy: 0.836', 'Test accuracy: 0.811']
Now, I get the concept of the C-parameter as in applying more or less regularization, thereby determining the trade-off between more generalization or training set performance. However, when looking at the output above, I don't completely get the intuition.
In Logistic Regression, we observe that a relatively high C-value (so less regularization/chance of overfitting) gives the best results. Since the data contains relatively a large amount of features (30), this confirms the intuition that relatively complex models perform better with more emphasis on the training data.
In SVM, I don't completely get why, with a relatively low C-value (0.001), also the training score improves compared to a higher C-value (e.g. 0.01). The testing score makes sense, since with more regularization, generalization (and hence using softer margins) improve. But how do we explain that even the training score improves, even though we put less emphasis on it?