I want to run spark.ml.recommendation als on spark 2.1.0 with pyspark using a web-page-visit data. I've wikipedia data containing user-id, page-id and counts.The data is consisted of 100000 rows. Here are the specs of my data:
+-------+------------------+
|summary| user-id|
+-------+------------------+
| count| 100000|
| mean| 24542.75736|
| stddev|21848.264794583836|
| min| 0|
| max| 68488|
+-------+------------------+
+-------+------------------+
|summary| page-id|
+-------+------------------+
| count| 100000|
| mean| 257.55426|
| stddev|265.56649346534084|
| min| 0|
| max| 1317|
+-------+------------------+
+-------+------------------+
|summary| cnt|
+-------+------------------+
| count| 100000|
| mean| 412.4471|
| stddev|4269.7557065972205|
| min| 11.0|
| max| 309268.0|
+-------+------------------+
I've split my data 80/20 for training and test respectively and tried to run als on my data but it results with NaN. Then I found a workaround and get it working. After that, I tried to calculate rmse on my data the result is around 3000-4000 with some combination of parameters.
I've read some books, articles and I've watched some video tutorials on this, but many of the stuff is related to movielens data set which is rating-based as i see and does not offer much for my problem. I've learned that my case is called implicit-feedback and only example I've encountered last.fm example on a book. However i couldn't get much help of it.
So my questions are:
1) How to handle als recommendation on a data-set which has a rating column with much wider range than the one in movielens which is in a range between 1-5?
Here mine is between 11 and 309628.
2) Is rmse an important metric in implicit-feedback on deciding whether the model is ok or not?
3) Any other recommendation on handling this kind of data while trying to run spark-ml als on it?