I run an ALS recommender model (558K users, about 300 products). The I get the top-10 recommendations for each user by using the recommendProductsForUsers call. In order to compute RankingMetrics, I need to get this in a shape of ( userID, [array of 10 product IDs] ) and then join it up with the actual ratings of the users which are also in a shape ( userID, [products used list] ).
However, when I run the 'recommendProductsForUsers(10)' and try to use it, I always run into a pysaprk error 'ImportError: No module named numpy' which is not the case as I use numpy in pyspark without problem. It seems more like a 'out of resources error' ? (using 8g for workers, 8g for daemon, 8g for driver)
Is there a better/more efficient way of reshaping the output of recommendProductsForUsers(10) into the required RDD for the RankingMetrics ? What am I doing wrong ? In the official documentation there is only a scala example, the pyspark example given shows the wrong metrics (regression metrics). See : https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html#ranking-systems
The code :
# RankingMetrics on VALIDATION + TEST set
# Get recommendations per user (top-n list)
# Need to collect() immediately or resource/numpy error ! Then parallelize
again into an RDD for remap/join
t0_adv=time()
userRecommended = bestModel.recommendProductsForUsers(n).collect()
userRecommended_RDD = sc.parallelize(userRecommended)\
.map(lambda (k,v): (k, [x[1] for x in v]))
# Get the actual usage
userMovies = validation_RDD.groupBy("new_uid")\
.agg(F.collect_set("new_pgm")\
.alias("actualVideos")).rdd
# join into a prediction vs actual RDD
predictionsAndLabels = userRecommended_RDD.join(userMovies).cache()
# Get the metrics
metricsRank = RankingMetrics(predictionsAndLabels.map(lambda r: r[1]))
tt_adv = time() - t0_adv
print "Time required : %.0f" % tt_adv
print "p5 %.8f" % metricsRank.precisionAt(5)
print "MAP %.8f" % metricsRank.meanAveragePrecision
print "nDCG %.8f" % metricsRank.ndcgAt(5)