2

Switching a bunch of for-loop code to use a parallel stream is apparently causing a certain part of the code to be ignored.

I'm using MOA and Weka with Java 11 to run a simple recommendation engine example, taking cues from the source code of moa.tasks.EvaluateOnlineRecomender, which uses MOA's internal task setup to test the accuracy of the Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) implementation provided by MOA. Instead of using MOA's prepared MovielensDataset class, I switched over to Weka's Instances for prospects of applying Weka's ML tools.

The time it took to process about a million instances (I'm using the Movielens 1M dataset) was about 13-14 minutes. In a bid to see improvements, I wanted to run it on a parallel stream, and became suspicious when the task finished in about 40 seconds. I found that BRISMFPredictor.predictRating was always producing 0 within the parallel stream's body. Here's the code for either case:

Code for initialisation:

import com.github.javacliparser.FileOption;
import com.github.javacliparser.IntOption;

import moa.options.ClassOption;
import moa.recommender.predictor.BRISMFPredictor;
import moa.recommender.predictor.RatingPredictor;
import moa.recommender.rc.data.RecommenderData;
import weka.core.converters.CSVLoader;

...

private static ClassOption datasetOption;
private static ClassOption ratingPredictorOption;
private static IntOption sampleFrequencyOption;
private static FileOption defaultFileOption;

static {
    ratingPredictorOption = new ClassOption("ratingPredictor",
            's', "Rating Predictor to evaluate on.", RatingPredictor.class,
            "moa.recommender.predictor.BRISMFPredictor");
    sampleFrequencyOption = new IntOption("sampleFrequency",
            'f', "How many instances between samples of the learning performance.", 100, 0, 2147483647);
    defaultFileOption = new FileOption("file",
            'f', "File to load.",
            "C:\\Users\\shiva\\Documents\\Java-ML\\mlapp\\data\\ml-1m\\ratings.dat", "dat", false);
}

... and inside main() (a quirk with Weka's CSVLoader required that I replace the default :: delimiter with +)

    var csvLoader = new CSVLoader();
    csvLoader.setSource(defaultFileOption.getFile());
    csvLoader.setFieldSeparator("+");
    var dataset = csvLoader.getDataSet();
    System.out.println(dataset.toSummaryString());

    var predictor = new BRISMFPredictor();
    predictor.prepareForUse();

    RecommenderData data = predictor.getData();
    data.clear();
    data.disableUpdates(false);

Now, alternating between the following snippets:

for (var instance : dataset) {
    var user = (int) instance.value(0);
    var item = (int) instance.value(1);
    var rating = instance.value(2);

    double predictedRating = predictor.predictRating(user, item);

    System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
                    user, item, Math.round(rating), predictedRating);
}

(Now being a noob in everything concurrent):

dataset.parallelStream().forEach(instance -> {
    var user = (int) instance.value(0);
    var item = (int) instance.value(1);
    var rating = instance.value(2);

    double predictedRating = predictor.predictRating(user, item);

    System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
                    user, item, Math.round(rating), predictedRating);
});

Now I decide that heck, maybe this operation can't be done in parallel, and I switch it to use stream(). Even then, the segment seems to be completely ignored since the output is again 0.0 each time

dataset.stream().forEach(instance -> {
    var user = (int) instance.value(0);
    var item = (int) instance.value(1);
    var rating = instance.value(2);

    double predictedRating = predictor.predictRating(user, item);

    System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
                    user, item, Math.round(rating), predictedRating);
});

I have tried removing the print statement from the run, but without avail.

Obviously, I get the expected output lines consisting of actual and predicted rating within about 13 minutes in the first case, but find that the predicted rating is 0.0 in the second case with suspiciously low execution time. Is there something I'm missing out on?

EDIT: using dataset.forEach() does the same thing. Perhaps a quirk of lambdas?

Vivraan
  • 163
  • 1
  • 9
  • 1
    What is the type of `dataset`? – ernest_k Dec 23 '18 at 07:11
  • 1
    The above question is the key + what are you getting if you do `dataset.forEach`? – Grzegorz Piwowarek Dec 23 '18 at 07:18
  • and what is `predictor.predictRating`'s definition? – Naman Dec 23 '18 at 07:18
  • @ernest_k `weka.core.Instances` (http://weka.sourceforge.net/doc.dev/weka/core/Instances.html) – Vivraan Dec 23 '18 at 08:08
  • @nullpointer http://jwijffels.github.io/RMOA/MOA_2014_04/doc/apidocs/moa/recommender/rc/predictor/impl/BRISMFPredictor.html#predictRating(int,%20int) – Vivraan Dec 23 '18 at 08:10
  • @mushi what do you get when you run `dataset.stream().count()`? It's possible that the stream returned by that class has dodgy behavior... – ernest_k Dec 23 '18 at 08:11
  • 2
    [A Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve), please? It would greatly help. – Ole V.V. Dec 23 '18 at 08:20
  • @OleV.V. I updated towards _Complete_ but probably not _Minimal_, did the edit help? – Vivraan Dec 23 '18 at 08:29
  • @ernest_k the output is 1000209, as I would expect. But the results are all _0.0_. – Vivraan Dec 23 '18 at 08:30
  • 1
    @mushi what exactly are you referring to when you say "results are all 0.0"? `predictedRating`? – ernest_k Dec 23 '18 at 08:50
  • @ernest_k yes, that's the one. – Vivraan Dec 23 '18 at 10:24
  • You don't show how you train your `BRISMFPredictor`, so isn't the difference just in the way you train it? Note that it seems you are using [`moa.recommender.predictor.BRISMFPredictor`](http://jwijffels.github.io/RMOA/MOA_2014_04/doc/apidocs/moa/recommender/predictor/BRISMFPredictor.html) and not the [`moa.recommender.rc.predictor.impl.BRISMFPredictor`](http://jwijffels.github.io/RMOA/MOA_2014_04/doc/apidocs/moa/recommender/rc/predictor/impl/BRISMFPredictor.html) that you linked above in the comments (it seems there are two classes with the same name). – Didier L Dec 23 '18 at 19:16
  • @DidierL Indeed, I tried switching to the internal implementation using reflection once to check for differences but there wasn't much of a change. I also did skip the part where you must set the rating on the `RecommenderData` object since the earlier hunch was that it was not playing nice in parallel. My program is pretty much a rip-off of `moa.tasks.EvaluateOnlineRecommender` without much of the struts used: https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/tasks/EvaluateOnlineRecommender.java – Vivraan Dec 24 '18 at 06:34

0 Answers0