0

I have a training set looks something like this.

features:categorical/numerical

output:binary 1/0

[1] feature[1][1] feature[1][2] ... feature[1][j]
[2] feature[2][1] feature[2][2] ... feature[2][j]
.
.
.
[i] feature[i][1] feature[i][2] ... feature[i][j]

Suppose some samples(row) have "good" value combinations that are likely to yield similar output, whereas others have "bad" value combinations thus difficult to predict.

My goal is, by getting rid of of those bad samples which lack regularity, I want to improve final accuracy. Can someone tell me what could be the best algorithm or preprocess to automatically detect those samples so that only the good samples are going to be trained? Thank you in advance!

ENV: MXNet, R

k-hir
  • 3
  • 2
  • The training data should represent the real world data that you want to predict on in the future. You can't just ignore some of the data, as it will affect your future performance. If you can provide more information on the problem you are trying to solve with the ML model, it might make more sense. – Guy Jun 24 '17 at 14:36
  • The reason I want to ignore those samples is because they are never predictable. For an example, say if I have samples of human population and ape population mix and there is no way to tell which is which manually. By removing ape population I want to train a model that is specifically designed for human sample prediction. – k-hir Jun 27 '17 at 01:46
  • But you need to build a classifier to predict which one is human and which one is an ape, either for the training data or the future real life data. Anyway, if you have a way to remove the noise both on training and real life data, do that before you train the model or call the prediction model. – Guy Jul 04 '17 at 14:34
  • I would also start with building the ape/human classifier. You can try a really quick solution - to split the data onto 2 clusters with the K-Means (k=2) and this might give you ape/human classifier out of the box. – Viacheslav V Kovalevskyi Aug 01 '17 at 20:47

1 Answers1

0

With deep-learning models you typically have enough degrees of freedom for the model to learn structure in the feature space that's useful for predictions. If there are two groups with different characteristics (e.g. apes and humans), and knowing this group is useful in making a prediction, the model should be able to learn this.

Additionally if your ultimate aim is to classify, it's common in deep-learning models to have a softmax layer as an output which can be interpreted as a probability of given class; the higher this probability is the more confident you can be in the prediction. You should calibrate and evaluate this probability as suggested in this paper.

On the other hand, if you're looking to apply more simple models (e.g. linear models), you may want to perform unsupervised learning beforehand and include this as a categorical feature in your model. As suggested by Viacheslav a clustering algorithm like K-Means could work for your dataset, otherwise you may want to look at Gaussian Mixture Models or DBSCAN.

Thom Lane
  • 993
  • 9
  • 9