4

I have recently learned how supervised learning works. It learns labeled dataset and predict unlabeled datum.

But, I have a question that is it fine to teach the created model with the predicted datum and then predict unlabeled datum again. And repeat the process.

For example, Model M was created by 10 labeled dataset D, then Model M predicts datum A. Then, data A is added into dataset D and creates Model M again. The process is repeated with the amount of unpredicted data.

hitechnet
  • 47
  • 3

2 Answers2

2

What you are describing here is a well known technique known as (among other names) "selftraining" or "self semi-supervised training". See for example slides https://www.cs.utah.edu/~piyush/teaching/8-11-print.pdf. There are hundreads of modifications around this idea. Unfortunately, in general it is hard to prove that it should help, so while it will help for some datasets it will hard the other ones. The main criterion here is the quality of the very first model, since selftraining is based on the assumption, that your original model is really good, thus you can trust it enough to label new examples. It might help with slow concept drift with a strong model, but will fail misserably with weak models.

lejlot
  • 64,777
  • 8
  • 131
  • 164
-1

What you describe is called online machine learning, incremental supervised learning, Updateable Classifiers... There are bunch of algorithms that accomplish these behavior. See for example weka toolbox Updateable Classifiers. I suggest to look following ones.

  • HoeffdingTree
  • IBk
  • NaiveBayesUpdateable
  • SGD
Atilla Ozgur
  • 14,339
  • 3
  • 49
  • 69
  • Sorry, but those topics are irrelevant. Online learning is for cases where, at each step, you must predict an outcome, and, following that, you get the correct label. This is unrelated to the question, where the OP is simply suggesting to re-feed the algorithm with a training set augmented by the labels obtained by the original model. There is no further stage where the true labels are revealed. – Ami Tavory Sep 02 '16 at 15:27
  • @AmiTavory I suggest re-reading what is online reading. That is what he is describing. – Atilla Ozgur Sep 02 '16 at 15:29
  • Just to make sure, with what exactly in my above comment do you disagree: 1. contrary to what I think, the OP is describing a situation where, iteratively, more labeled data is coming in. 2. contrary to what I think, the online algorithms you mentioned continue updating the prediction even if no more labels are coming in (only independent-variable instances are being added). – Ami Tavory Sep 02 '16 at 15:37
  • 1
    @Ami is right, this does not address OPs question, he tries to do (self)semi-supervised learning, not just incremental learning. – lejlot Sep 02 '16 at 16:54
  • @Ami is right, the question is about semi-supervised learning, not online learning. Online learning algorithms might be used to *implement* semi-supervised learning algorithms, though. – Niki Sep 02 '16 at 17:26