Selecting samples for supervised machine learning

Question

How does one select a sample size and sample set (for training and testing) for a binary classification problem to be solved by applying supervised learning?

The current implementation is based on 15 binary features which we may expand to 20 or possibly 24 binary features in order to improve accuracy metrics. The classification is based on a look up in decision table which we would like to replace by a decision with a machine learning classifier. Part of the goal is also to gauge our current accuracy metrics.

a) What is the minimal sample size to choose for the supervised training so as to balance the desired accuracy and cost ? b) How do we select the actual samples to use for training/test set?

Computational learning theory defines the minimal sample given the hypothesis space, the desired probability of keeping errors below a certain threshold. Please provide an explanation and possible examples applying the formulas.

The sample classification training/test set will be collected with a human decision. So, obviously there is a cost involved with selecting this sample set. And then funding the project becomes harder when cost and benefit cannot be easily put down on paper.

Sorry, but IMHO this is a tricky question that very much depends on the circumstances. In any case, I don't think you'll find this is the right site for it. — Ami Tavory, Jun 10 '15 at 23:13
the (a)-part needs some more clarification - you already have a data set and you want to know how large the training portion of that set should be OR you don't have any data and want to know how much do you need to collect? The answer to the (b)-part is simple: You should divide the whole data as randomly as you can. This gives you approx. the same distribution over classes in both training and test sets. — Jindra Helcl, Jun 17 '15 at 12:10
(a) is about Computational Learning Theory. Which of the various formulas do I apply for a binary classification problem with n binary features to determine minimum sample size for the training set for supervised machine learning. This with error rate of epsilon and delta as the probability of keeping the error rate below the desired error rate. The distribution of class 1 and class 2 of the binary classification decision has to play some role in what samples get fed to training set. How does one go about selecting the set itself as well as the samples themselves for n binary features? — Parag Ahire, Jun 18 '15 at 14:04
Further, there is a cost involved with supervised decisions for learning. How does one balance the cost/benefit by selecting the optimal sample set (not just a minimal set) for achieving the desired error rate epsilon / probability of error delta being below desired error rate? This without having to go through multiple iterations so to minimize project costs? — Parag Ahire, Jun 18 '15 at 14:12

score 1 · Answer 1 · answered Jun 26 '15 at 01:55

There is no easy way to determine a minimal sample size since there are no hard and fast rules regarding sample sizes when it comes to machine learning. Many classifiers can be applied to binary classification, e.g. SVM, and there are a number of sampling techniques which can be applied, depending on the structure of the data, the underlying system and the aims of the analysis. Your reference to the selection of the set itself is somewhat confusing: are you asking how to determine the minimum amount of data required to build an accurate classifier? The answer depends on the classifier being used and the learning ability of the classifier. Also, models trained on smaller models may not generalize as well as those trained on larger sets, even if you get adequate error rates, so if you are primarily interested in accurate classification of previously unseen records, you will want to keep this in mind. As for selecting a training sample set, this depends on the structure of the data and the sampling method used. You might prefer to use cross-validation techniques when training the model because of over-fitting.

Selecting samples for supervised machine learning

1 Answers1