0

In a post "The "Cross-Validation - Train/Predict" misunderstanding" by Patrick Schratz

https://mlr-org.com/docs/cv-vs-predict/

mentioned that:

(a) CV is done to get an estimate of a model’s performance.

(b) Train/predict is done to create the final predictions (which your boss might use to make some decisions on).

It means in mlr3, if we are in academia, need to publish papers, we need to use the CV as we intend to compare the performance of different algorithms. And in industry, if our plan is to train a model and then have to use again and again on industry data to make predictions, we need to use the train/predict methods provided by mlr3 ?

Is it something which I completely picked wrong?

Thank you

khan1
  • 1
  • 3

1 Answers1

1

You always need a CV if you want to make a statement about a model's performance.

If you want to use the model to make predictions to unknown data, do a single fit and then predict.

So in practice, you need both: CV + "train+predict".

PS: Your post does not really fit to Stackoverflow since it is not related to a coding problem. For statistical questions please see https://stats.stackexchange.com/.

PS2: If you talk about a post, please include the link. I am the author of the post in this case but most other people might not know what you are talking about ;)

pat-s
  • 5,992
  • 1
  • 32
  • 60
  • Thanks pat-s, I edited my post and included the link. – khan1 Feb 07 '21 at 20:53
  • but isn't it a fact that in academia (where we have to compare the performance of an algorithm with few others), the use of CV is widely used as we use multi publicly available datasets and multi classifiers. . – khan1 Feb 07 '21 at 20:56
  • This really depends what you want to achieve in "academia". If you goal is to just compare algorithms across some datasets then using just CV (nested CV to avoid bias) is enough. However if you want to build a now model to use it for actual predicting in order to make some decisions then after you perform all of the CVs to see what model is good (best), you will fit the model on all of the labeled data and use it to predict unlabeled data. – missuse Feb 08 '21 at 10:48
  • @missuse, ok I got your point.. It means if we have to compare the algorithms and point out the best algorithm to provide a message to the community (as usually people do for publications in academia), we should use CVs. If we have to use a model for predicting some values (probably industry data), first we have to use the CV (to find the best model) and then select that model for predicting our data? – khan1 Feb 12 '21 at 15:01
  • One suggestion, if I can give here and if it is feasible to accomplish: Isn't it a good idea there should be a public forum/mailing forum etc specifically for mlr.? mlr3 is new and a lot of us are new to it, so a public forum would be a good idea.. updates about mlr3 could also be discussed there. – khan1 Feb 12 '21 at 15:06
  • All your questions here are generic to modeling and not mlr3. They have been answered hundred times in many questions/forums posts. I suggest gaining more experience from such resources to better understand the bigger picture. When it comes to implementing your ideas with code in mlr3(or other framework la), Stackoverflow is a great place to ask. Alternatively open a Github issue. – pat-s Feb 12 '21 at 15:18
  • CV is usually used when the model has many parameters and we have few observations. In case of simple models with many observations CV is not usually done nor needed. – skan Aug 14 '22 at 11:24
  • @skan This does not sound right to me, can you add some references to back your claim? Why would CV be related to the amount of parameters or observations? Every model requires a fair and unbiased performance estimation. Also FWIW, a model is not "simple" only because it has few observations or parameters. – pat-s Aug 14 '22 at 22:15