-1

I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:

A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.

B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)

Thanks a lot for your help!!!

Kailegh
  • 199
  • 1
  • 13
  • Can you include plots visualising the data distribution, e.g., two dimensional scatter plotters colored by class membership. any attempt to answer without is just guessing – CAFEBABE May 26 '17 at 23:10
  • I will not receive the data until some point in next week, I am currently preparing the algorithm sorry, as soon as I have them I will post them – Kailegh May 27 '17 at 10:38

2 Answers2

1

My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.

The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.

shirowww
  • 533
  • 4
  • 18
  • ok, thanks a lot, I will try both of them, with this few samples should I still reserve the 25% for testing? – Kailegh May 26 '17 at 19:57
  • by the way, I am know reading about SVM, and there are a lot of types, svm, svr, nusvm, nusvr..... is there a paper or something where is explained when each of them should be used? – Kailegh May 26 '17 at 20:41
  • Of course, you should use a convenient partition for testing. Another option is cross validation, e. g. 10-fold cross validation. – shirowww May 26 '17 at 21:33
  • Don't lose your head with the multiple variations of every algorithm. Focus on standard ones and with time and experience, variations will come naturally to your workflow. It's like trying to apply all the different types of [insert your favorite algorithm here]. Let's say neural networks... overkill. – shirowww May 26 '17 at 21:37
  • ok, I will take a look into cross-validation and start trying different algorithms, thanks a lot for your answers, regarding the part where I have incomplete samples, should I use them somehow? – Kailegh May 27 '17 at 10:37
  • 1
    You could use them, but they will be less valuable, due to incomplete information. The usual procedure is to complete the missing values using the mean value or median value, mainly, but you would be introducing 'noise', due to missing information. My advise is to adhere to the full samples only, unless absolutely necessary. – shirowww May 27 '17 at 16:49
  • yeah, I have thought about the possibility of computing the median value, the thing is that if I only use full samples maybe my dataset would be reduced to 80 ( haven't received it yet) I think I will try the three possibilities and check what gets the best results, later I will comment my results here – Kailegh May 28 '17 at 15:55
-1

Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.

Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217