I am trying to build a recommendation system by using Spark ML ALS where data are as follows
"User-ID";"ISBN "; "Book-Rating"
276725;034545104;0
276726;0155061224;5
276727;0446520802;0
276729;052165615;3
276729;0521795028;6
I am using Spark 2.1.0 and mongoldb to load data. Here is my piece of code that defines the dataframe and his schema after casting.
/*
* Chargement de données de rating
*/
val dfrating = spark.loadFromMongoDB(readConfig)
val bookRatings = dfrating.selectExpr("cast(User_ID as Long) User_ID " ,"cast(ISBN as Long) ISBN ", "Book_Rating")
bookRatings.printSchema()
val als = new ALS().setMaxIter(10).setRegParam(0.01).setUserCol("User_ID").setItemCol("ISBN").setRatingCol("Book_Rating")
val model = als.fit(training)
After compiling, I have got
root
|-- User_ID: long (nullable = true)
|-- ISBN: long (nullable = true)
|-- Book_Rating: integer (nullable = true)
+-------+----------+-----------+
|User_ID| ISBN|Book_Rating|
+-------+----------+-----------+
| 215| 61030147| 6|
| 5750|1853260045| 0|
| 11676| 743244249| 0|
| 11676|1551665700| 0|
Caused by: java.lang.IllegalArgumentException: **ALS **only supports values in Integer range for column**s User_ID and ISBN. ****Value** 8.477024456E9 **was out of Integer range.******
at org.apache.spark.ml.recommendation.ALSModelParams$$anonfun$1.apply$mcID$sp(ALS.scala:87)
Is there any other solution to get things running? I have got these suggestions (How to use mllib.recommendation if the user ids are string instead of contiguous integers? How to use long user ID in PySpark ALS and also Non-integer ids in Spark MLlib ALS) for the same problem, but I don't know how to begin.
Here is what I do.
val isbn_als = new StringIndexer()
.setHandleInvalid("skip")
.setInputCol("ISBN")
.setOutputCol("ISBN_als")
.fit(uRatings)
val isbn_als_reverse = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
val als = new ALS().setMaxIter(10).setRegParam(0.01).setUserCol("User_ID").setItemCol("ISBN_als").setRatingCol("Book_Rating")
/*
* On définit l'ordre des opérations à effectuer
*/
println("On passe au Pipeline")
val alsPipeline = new Pipeline().setStages(Array(isbn_als, als, isbn_als_reverse))
/*
* On construit le modèle de recommandation à partir des données de Training
*/
println("On passe à la construction du modèle")
val alsModel = alsPipeline.fit(training)
/*
* On exécute le modèle sur les données de Test, puis on affiche un échantillon de prédictions
*/
println("On exécute le modèle sur les données de Test")
val alsPredictions = alsModel.transform(test).na.drop()
println("Affichage des prédictions")
alsPredictions.select($"User_ID",$"ISBN", $"Book_Rating", $"prediction").show(20)
But I have got this exception when I use IndexToString()
on the pipeline.
On passe au Pipeline
On passe à la construction du modèle
On exécute le modèle sur les données de Test
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.*attribute.NominalAttribute*
at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:313)
at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:305)
at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:305)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
When I do not use IndexToString()
, I have got a negative prediction.
+-------+---------+-----------+-------------+
|User_ID| ISBN|Book_Rating| prediction|
+-------+---------+-----------+-------------+
| 140340|786881852| 10| 6.9798374|
| 127327|786881852| 0|-1.2718141E-4|
| 103336|786881852| 0| 1.2374072|
| 138578|786881852| 9| 8.200257|
| 172742|786881852| 0| -1.3278971|
| 31909|786881852| 6| 5.997123|
| 69554|786881852| 5| 2.819587|
| 173650|786881852| 0| 0.42850634|
I suppose the negative prediction is due to IndexToString()
that is not used. If so, how to use IndexToString()
on the pipeline?