Why does scikit-learn Logistic Regression works well even feature number is much larger than sample number

Question

Recently, I'm working on some projects and obtain 30 positive samples and 30 negative samples. Each of them has 128 features (128 dimensional).

I used "LeaveOneOut" and "sklearn.linear_model.LogisticRegression" to classify these samples and obtained a satisfactory result (AUC 0.87). I told my friend the results and he asked how could I compute the parameters with only 60 samples, the dimension of the feature vectors is larger than the number of the samples.

Now I have the same question. I checked the source code of the toolkit and still have no idea about this question. Could someone help me with this question? Thanks!

I think it's because of your particular data. There is convenient data, other data could give you worse results. — sergzach, Nov 07 '17 at 21:46

score 0 · Answer 1 · answered Nov 06 '17 at 16:20

The situation you have laid out is a common one in machine learning applications, which is when you have a limited number of training examples in comparison to your number of features (i.e. m < n). Now, you are dealing with a classification problem, therefore your algorithm is outputting either a positive or negative hypothesis given your feature input. It would help to know the training set error compared to the cross-validation set error and test set error for your analysis. If you could post your code, that would help in explaining some further details.

Based a quick Google search of sklearn.linear_model.LogisticRegression, it appears as if it implements regularized logistic regression using L2 regularization. I would encourage you to look into the following pertaining to regularization:

https://en.wikipedia.org/wiki/Tikhonov_regularization

I would also recommend reading into the bias/variance discussion as it pertains to underfitting and overfitting your dataset:

https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted

Thank you for your answer. My code is similar with the example code on http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html. The only difference is the input data which is 60*128 in my case. It seems that the regularization solves the problem. I'll try to learn more about that. — Ping Luo, Nov 06 '17 at 16:41
Ok, then I would read into the links above pertaining to regularization and bias/variance as it pertains to your problem. With a large number of features and a comparable amount of training examples (in your case, ~50%), the algorithm can begin to identify features that play a role in identifying a positive or negative hypothesis as well as identify features that do not contribute heavily (a small weight in the parameter vector theta) — rahlf23, Nov 06 '17 at 16:44

Why does scikit-learn Logistic Regression works well even feature number is much larger than sample number

1 Answers1