-1

I must participate in a research project regarding a deep learning application for classification. I have a huge dataset containing over 35000 features - these are good values, taken from laboratory.

The idea is that I should create a classifier that must tell, given a new input, if the data seems to be good or not. I must use deep learning with keras and tensor flow.

The problem is that the data is not classified. I will enter a new column with 1 for good and 0 for bad. Problem is, how can I find out if an entry is bad, given the fact that the whole training set is good?

I have thought about generating some garbage data but I don't know if this is a good idea - I don't even know how to generate it. Do you have any tips?

Vanhaeren Thomas
  • 367
  • 4
  • 15

1 Answers1

2

I would start with anamoly detection. You can first reduce features with f.e. an (stacked) autoencoder and then use local outlier factor from sklearn: https://scikit-learn.org/stable/modules/outlier_detection.html

The reason why you need to reduce features first is, is because your LOF will be much more stable.

Brecht Coghe
  • 286
  • 1
  • 7
  • Thanks for your answer I'll look into it. But there are no anomalies in the dataset, no "bad" data I can use for training. It's an incomplete dataset if you want to look at it this way. I want to "complete" it with bad data so that it will be able to distinguish good from bad, but I don't know if its a good idea. – Vanhaeren Thomas Feb 06 '19 at 16:48
  • You don't need to specify bad examples in LOF – Brecht Coghe Feb 06 '19 at 17:00
  • That's the point of the algorithm, it models the distribution of normal data (think a field of RBFs), and then you can check if a new point fits in the distribution or not. – Matthieu Brucher Feb 06 '19 at 17:00