1

I want to run spark.ml.recommendation als on spark 2.1.0 with pyspark using a web-page-visit data. I've wikipedia data containing user-id, page-id and counts.The data is consisted of 100000 rows. Here are the specs of my data:

+-------+------------------+  
|summary|           user-id|
+-------+------------------+  
|  count|            100000|  
|   mean|       24542.75736|  
| stddev|21848.264794583836|  
|    min|                 0|  
|    max|             68488|
+-------+------------------+

+-------+------------------+
|summary|           page-id|
+-------+------------------+
|  count|            100000|
|   mean|         257.55426|
| stddev|265.56649346534084|
|    min|                 0|
|    max|              1317|
+-------+------------------+

+-------+------------------+
|summary|               cnt|
+-------+------------------+
|  count|            100000|
|   mean|          412.4471|
| stddev|4269.7557065972205|
|    min|              11.0|
|    max|          309268.0|
+-------+------------------+

I've split my data 80/20 for training and test respectively and tried to run als on my data but it results with NaN. Then I found a workaround and get it working. After that, I tried to calculate rmse on my data the result is around 3000-4000 with some combination of parameters.

I've read some books, articles and I've watched some video tutorials on this, but many of the stuff is related to movielens data set which is rating-based as i see and does not offer much for my problem. I've learned that my case is called implicit-feedback and only example I've encountered last.fm example on a book. However i couldn't get much help of it.

So my questions are:

1) How to handle als recommendation on a data-set which has a rating column with much wider range than the one in movielens which is in a range between 1-5?

Here mine is between 11 and 309628.

2) Is rmse an important metric in implicit-feedback on deciding whether the model is ok or not?

3) Any other recommendation on handling this kind of data while trying to run spark-ml als on it?

dattomatto
  • 11
  • 2

2 Answers2

0

is rmse an important metric in implicit-feedback on deciding whether the model is ok or not?

It is not. Implicit model score are in different scale. As explained by Danilo Ascione a recommended approach is https://stackoverflow.com/a/41162688.

Community
  • 1
  • 1
0

Regarding your NaN problem, did you have a look at coldStartStrategy which was added to Spark not long ago? (cf https://github.com/apache/spark/pull/17102)

Regarding your evaluation problem, RMSE is not a good metric when using ALS with implicit feedback on as you've found out.

In your case, ranking metrics are more appropriate. The two most common are:

Unfortunately, those are not part of Spark as they don't really fit the Evaluator API so you'll have to implement them yourself.

BenFradet
  • 453
  • 3
  • 10
  • Thanks for the help. Yes, I've heard about the coldStartStraregy but I couldn't succeed in use it in spark 2.1.0. For evaluation problem, I will consider implementing ranking metrics. – dattomatto Mar 27 '17 at 14:01