1

Recently, I'm working on some projects and obtain 30 positive samples and 30 negative samples. Each of them has 128 features (128 dimensional).

I used "LeaveOneOut" and "sklearn.linear_model.LogisticRegression" to classify these samples and obtained a satisfactory result (AUC 0.87). I told my friend the results and he asked how could I compute the parameters with only 60 samples, the dimension of the feature vectors is larger than the number of the samples.

Now I have the same question. I checked the source code of the toolkit and still have no idea about this question. Could someone help me with this question? Thanks!

Ping Luo
  • 21
  • 2
  • I think it's because of your particular data. There is convenient data, other data could give you worse results. – sergzach Nov 07 '17 at 21:46

1 Answers1

0

The situation you have laid out is a common one in machine learning applications, which is when you have a limited number of training examples in comparison to your number of features (i.e. m < n). Now, you are dealing with a classification problem, therefore your algorithm is outputting either a positive or negative hypothesis given your feature input. It would help to know the training set error compared to the cross-validation set error and test set error for your analysis. If you could post your code, that would help in explaining some further details.

Based a quick Google search of sklearn.linear_model.LogisticRegression, it appears as if it implements regularized logistic regression using L2 regularization. I would encourage you to look into the following pertaining to regularization:

https://en.wikipedia.org/wiki/Tikhonov_regularization

I would also recommend reading into the bias/variance discussion as it pertains to underfitting and overfitting your dataset:

https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted

rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • Thank you for your answer. My code is similar with the example code on http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html. The only difference is the input data which is 60*128 in my case. It seems that the regularization solves the problem. I'll try to learn more about that. – Ping Luo Nov 06 '17 at 16:41
  • Ok, then I would read into the links above pertaining to regularization and bias/variance as it pertains to your problem. With a large number of features and a comparable amount of training examples (in your case, ~50%), the algorithm can begin to identify features that play a role in identifying a positive or negative hypothesis as well as identify features that do not contribute heavily (a small weight in the parameter vector theta) – rahlf23 Nov 06 '17 at 16:44