pySpark RankingMetrics error 'ImportError: No module named numpy'

Question

I run an ALS recommender model (558K users, about 300 products). The I get the top-10 recommendations for each user by using the recommendProductsForUsers call. In order to compute RankingMetrics, I need to get this in a shape of ( userID, [array of 10 product IDs] ) and then join it up with the actual ratings of the users which are also in a shape ( userID, [products used list] ).

However, when I run the 'recommendProductsForUsers(10)' and try to use it, I always run into a pysaprk error 'ImportError: No module named numpy' which is not the case as I use numpy in pyspark without problem. It seems more like a 'out of resources error' ? (using 8g for workers, 8g for daemon, 8g for driver)

Is there a better/more efficient way of reshaping the output of recommendProductsForUsers(10) into the required RDD for the RankingMetrics ? What am I doing wrong ? In the official documentation there is only a scala example, the pyspark example given shows the wrong metrics (regression metrics). See : https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html#ranking-systems

The code :

# RankingMetrics on VALIDATION + TEST set
# Get recommendations per user (top-n list) 
# Need to collect() immediately or resource/numpy error !  Then parallelize 
again into an RDD for remap/join 

t0_adv=time()
userRecommended = bestModel.recommendProductsForUsers(n).collect()
userRecommended_RDD = sc.parallelize(userRecommended)\
                         .map(lambda (k,v): (k, [x[1] for x in v]))

# Get the actual usage
userMovies = validation_RDD.groupBy("new_uid")\
    .agg(F.collect_set("new_pgm")\
    .alias("actualVideos")).rdd

# join into a prediction vs actual RDD
predictionsAndLabels = userRecommended_RDD.join(userMovies).cache()

# Get the metrics
metricsRank = RankingMetrics(predictionsAndLabels.map(lambda r: r[1]))
tt_adv = time() - t0_adv

print "Time required : %.0f" % tt_adv
print "p5   %.8f" % metricsRank.precisionAt(5)
print "MAP  %.8f" % metricsRank.meanAveragePrecision
print "nDCG %.8f" % metricsRank.ndcgAt(5)

This error is related to Numpy is not installed in the worker machine, or in one of the workers machine. Other reason could be if you are using Anaconda at the workers and the `PYSPARK_PYTHON` is not the path to the right python with Numpy installed. — Thiago Baldim, Jun 02 '17 at 13:52
Thanks for the quick answer, I will check the Anaconda & PYSPARK_PYTHON paths. I already checked all workers and they all have the Numpy 1.11.1 in the /opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/numpy directory. — Frank M, Jun 02 '17 at 13:57
May your question be duplicated with this one? https://stackoverflow.com/questions/35214231/importerror-no-module-named-numpy-on-spark-workers — Raul Luna, Aug 30 '18 at 07:54

pySpark RankingMetrics error 'ImportError: No module named numpy'

0 Answers0