Switching a bunch of for-loop code to use a parallel stream is apparently causing a certain part of the code to be ignored.
I'm using MOA and Weka with Java 11 to run a simple recommendation engine example, taking cues from the source code of moa.tasks.EvaluateOnlineRecomender
, which uses MOA's internal task setup to test the accuracy of the Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) implementation provided by MOA. Instead of using MOA's prepared MovielensDataset
class, I switched over to Weka's Instances
for prospects of applying Weka's ML tools.
The time it took to process about a million instances (I'm using the Movielens 1M dataset) was about 13-14 minutes. In a bid to see improvements, I wanted to run it on a parallel stream, and became suspicious when the task finished in about 40 seconds. I found that BRISMFPredictor.predictRating
was always producing 0 within the parallel stream's body. Here's the code for either case:
Code for initialisation:
import com.github.javacliparser.FileOption;
import com.github.javacliparser.IntOption;
import moa.options.ClassOption;
import moa.recommender.predictor.BRISMFPredictor;
import moa.recommender.predictor.RatingPredictor;
import moa.recommender.rc.data.RecommenderData;
import weka.core.converters.CSVLoader;
...
private static ClassOption datasetOption;
private static ClassOption ratingPredictorOption;
private static IntOption sampleFrequencyOption;
private static FileOption defaultFileOption;
static {
ratingPredictorOption = new ClassOption("ratingPredictor",
's', "Rating Predictor to evaluate on.", RatingPredictor.class,
"moa.recommender.predictor.BRISMFPredictor");
sampleFrequencyOption = new IntOption("sampleFrequency",
'f', "How many instances between samples of the learning performance.", 100, 0, 2147483647);
defaultFileOption = new FileOption("file",
'f', "File to load.",
"C:\\Users\\shiva\\Documents\\Java-ML\\mlapp\\data\\ml-1m\\ratings.dat", "dat", false);
}
... and inside main()
(a quirk with Weka's CSVLoader
required that I replace the default ::
delimiter with +
)
var csvLoader = new CSVLoader();
csvLoader.setSource(defaultFileOption.getFile());
csvLoader.setFieldSeparator("+");
var dataset = csvLoader.getDataSet();
System.out.println(dataset.toSummaryString());
var predictor = new BRISMFPredictor();
predictor.prepareForUse();
RecommenderData data = predictor.getData();
data.clear();
data.disableUpdates(false);
Now, alternating between the following snippets:
for (var instance : dataset) {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
}
(Now being a noob in everything concurrent):
dataset.parallelStream().forEach(instance -> {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
});
Now I decide that heck, maybe this operation can't be done in parallel, and I switch it to use stream()
. Even then, the segment seems to be completely ignored since the output is again 0.0 each time
dataset.stream().forEach(instance -> {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
});
I have tried removing the print statement from the run, but without avail.
Obviously, I get the expected output lines consisting of actual and predicted rating within about 13 minutes in the first case, but find that the predicted rating is 0.0 in the second case with suspiciously low execution time. Is there something I'm missing out on?
EDIT: using dataset.forEach()
does the same thing. Perhaps a quirk of lambdas?