1

I am trying out ML classification models (Logistic Regression and SVM) with different C-parameter values with Scikitlearn's breast cancer dataset:

C_values = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000]
models_list2 = []

# Loop to add model per c_value for Logistic Regression
for c_value in C_values:
    logR_clf_c = LogisticRegression(C=c_value, random_state=42).fit(X_train, y_train)
    
    model = "Logistic Regression"
    C_value = f"C-value: {c_value}"
    train_acc = "Train accuracy: {:.3f}".format(logR_clf_c.score(X_train, y_train))
    test_acc = "Test accuracy: {:.3f}".format(logR_clf_c.score(X_test, y_test))
    
    models_list2.append([model, C_value, train_acc, test_acc])
    
# Loop to add model per c_value for SVM
for c_value in C_values:
    svc_clf_c = LinearSVC(C=c_value, random_state=42).fit(X_train, y_train)
    
    model = "SVM"
    C_value = f"C-value: {c_value}"
    train_acc = "Train accuracy: {:.3f}".format(svc_clf_c.score(X_train, y_train))
    test_acc = "Test accuracy: {:.3f}".format(svc_clf_c.score(X_test, y_test))
    
    models_list2.append([model, C_value, train_acc, test_acc])

models_list2 = sortOnTestAcc(models_list2)
print(*models_list2, sep='\n')

The output gives the following:

['Logistic Regression', 'C-value: 1000', 'Train accuracy: 0.981', 'Test accuracy: 0.972']
['Logistic Regression', 'C-value: 10', 'Train accuracy: 0.965', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 100', 'Train accuracy: 0.972', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 10000', 'Train accuracy: 0.977', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 100000', 'Train accuracy: 0.972', 'Test accuracy: 0.965']
['Logistic Regression', 'C-value: 1', 'Train accuracy: 0.953', 'Test accuracy: 0.958']
['Logistic Regression', 'C-value: 0.1', 'Train accuracy: 0.944', 'Test accuracy: 0.944']
['Logistic Regression', 'C-value: 0.001', 'Train accuracy: 0.923', 'Test accuracy: 0.937']
['Logistic Regression', 'C-value: 0.01', 'Train accuracy: 0.934', 'Test accuracy: 0.930']
['SVM', 'C-value: 0.001', 'Train accuracy: 0.937', 'Test accuracy: 0.930']
['SVM', 'C-value: 0.01', 'Train accuracy: 0.934', 'Test accuracy: 0.930']
['Logistic Regression', 'C-value: 0.0001', 'Train accuracy: 0.920', 'Test accuracy: 0.923']
['SVM', 'C-value: 0.0001', 'Train accuracy: 0.927', 'Test accuracy: 0.923']
['SVM', 'C-value: 1', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 10', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 100', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 1000', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 10000', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 100000', 'Train accuracy: 0.908', 'Test accuracy: 0.909']
['SVM', 'C-value: 0.1', 'Train accuracy: 0.836', 'Test accuracy: 0.811']

Now, I get the concept of the C-parameter as in applying more or less regularization, thereby determining the trade-off between more generalization or training set performance. However, when looking at the output above, I don't completely get the intuition.

In Logistic Regression, we observe that a relatively high C-value (so less regularization/chance of overfitting) gives the best results. Since the data contains relatively a large amount of features (30), this confirms the intuition that relatively complex models perform better with more emphasis on the training data.

In SVM, I don't completely get why, with a relatively low C-value (0.001), also the training score improves compared to a higher C-value (e.g. 0.01). The testing score makes sense, since with more regularization, generalization (and hence using softer margins) improve. But how do we explain that even the training score improves, even though we put less emphasis on it?

Viol1997
  • 153
  • 1
  • 2
  • 13
  • Are the algorithms converging to the optimal solution? By reproducing your snippet, for me they are not; in such a case I would suggest having a look [here](https://stackoverflow.com/questions/52670012/convergencewarning-liblinear-failed-to-converge-increase-the-number-of-iterati). – amiola Feb 04 '21 at 13:22
  • 1
    @amiola I checked your link and even if I increased max_iter to 1000000, it kept on giving the error. However, setting dual=False fixed the problem (as suggested in the same link if n_samples > n_features), and now the C-values also make more sense. – Viol1997 Feb 04 '21 at 15:17

1 Answers1

2

You shouldn't necessarily expect a better or worse result just based on C without considering the data you are working with. It all depends on the data. Otherwise, why do you think there is a parameter that can be tuned based on the data at hand?

Having that said, you need to run your ML model for different train/test sets to make sure there is no luck involved in the results you are seeing! So, you can apply k-fold cross validation to measure the mean and std of your results.

So, at the end of your investigation, let's say you realize that a very low C in SVM (sklearn) gives good result, which means the regularization should be stronger (according to the definition of C in sklearn package). Thus, it means there might be a good amount of data points that have a high chance of misclassification and you are trying to consider them in your model. (But, let's say I have a very very red strawberry and a very very green cucumber, probably there is no need to put penalty for misclassification because they are completely distinct (if I consider the shape and color as the data's two features) and no such misclassification will occur in train or test set)

Nima S
  • 203
  • 2
  • 13
  • That makes sense, yes. So if we would have a very clearly linear separable (e.g. classes: apples and bananas, features: color and shape) dataset, lowering the default value of the C-parameter should not yield a lot of different results in any way. – Viol1997 Feb 04 '21 at 16:23
  • 1
    I think in that case there should be no penalty and thus, no regularization. However, I think changing C still affects (probably worsen accuracy, not sure though) because if there is a distinct feature, but you force your model to be more relaxed via regularization, you are mixing data close to each other while they can be easily separated via no regularization. Therefore, your model may not perform well on the test set because it cannot understand well the difference between features. – Nima S Feb 04 '21 at 16:28