I am looking the cross validation code example found in https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
It says:
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds,CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
So I don't understand why in the code the data is separated in training and testing:
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
Would be possible to apply cross validation and get predictions without separating the data?