How does one select a sample size and sample set (for training and testing) for a binary classification problem to be solved by applying supervised learning?
The current implementation is based on 15 binary features which we may expand to 20 or possibly 24 binary features in order to improve accuracy metrics. The classification is based on a look up in decision table which we would like to replace by a decision with a machine learning classifier. Part of the goal is also to gauge our current accuracy metrics.
a) What is the minimal sample size to choose for the supervised training so as to balance the desired accuracy and cost ? b) How do we select the actual samples to use for training/test set?
Computational learning theory defines the minimal sample given the hypothesis space, the desired probability of keeping errors below a certain threshold. Please provide an explanation and possible examples applying the formulas.
The sample classification training/test set will be collected with a human decision. So, obviously there is a cost involved with selecting this sample set. And then funding the project becomes harder when cost and benefit cannot be easily put down on paper.