0

I have this code:

val userIndexer: StringIndexer = new StringIndexer()
      .setInputCol("userKey")
      .setOutputCol("user")
val alsRatings = userIndexerModel.transform(ratings)
val matrixFactorizationModel = ALS.trainImplicit(alsRatings.rdd, rank = 10, iterations = 10)
val rec = matrixFactorizationModel.recommendProductsForUsers(20)

This gives me back recommendations with user ids. I want to have my user key strings back. What is the more efficient way to do it? Thanks.

PD: I certainly cannot understand why ALS library developers don't accept string labels. It's extremely painful and expensive to deal with conversions (string to int and then int to string) from the outside. Hope there is an issue or something in their backlog.

italktothewind
  • 1,950
  • 2
  • 28
  • 55
  • For example with [`IndexToString`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.IndexToString). The same API in Python: http://stackoverflow.com/q/33636944/1560062 – zero323 Mar 07 '17 at 20:38
  • IndexToString does not work when you have another Dataframe, it uses metadata in the same dataframe where StringToIndex was applied. – italktothewind Mar 07 '17 at 22:01
  • It works just fine if you use it correctly :) Check for example `setLabels`. – zero323 Mar 08 '17 at 00:55
  • Yes, but setLabels implies collecting the labels in a node because it works with arrays, not with RRDs or Datasets. That may not scale if the labels array is really big :/ – italktothewind Mar 08 '17 at 14:31
  • You know that `StringIndexer` already stores all the labels in driver memory, right? – zero323 Mar 08 '17 at 16:02
  • Hmm, so StringIndexer doesn't scale when labels don't fit the driver memory?Handling this mapping is so cumbersome... – italktothewind Mar 08 '17 at 18:20
  • You know, you can easily store maximum size (`Integer.MAX_VALUE`) array in memory of a single machine (yeah, it will be largish) so the problem is purely hypothetical. – zero323 Mar 08 '17 at 18:42

1 Answers1

0

I generally run the StringIndexer collect the labels in the Driver. And parallelize the labels with an index. And instead of calling Transform using the StringIndexer. I join the DataFrames to get the same result as a StringIndexer.

val swidConverter = new StringIndexer()
  .setInputCol("id")
  .setOutputCol("idIndex").fit(df)

val idDf = spark.sparkContext.parallelize(
            swidConverter.labels.zipWithIndex
        ).toDF("id", "idIndex").repartition(PARTITION_SIZE) // set the partition size depending on your data size.

// Joining the idDf(DataFrame) with the actual Data.
val indexedDF = df.join(idDf,idDf.col("id")===df.col("id")).select("idIndex","product_id","rating")

val als = new ALS()
  .setMaxIter(5)
  .setRegParam(0.01)
  .setUserCol("idIndex")
  .setItemCol("product_id")
  .setRatingCol("rating")

val model = als.fit(indexedDF)
val resultRaw = model.recommendForAllUsers(4)

// Joining the idDf(DataFrame) with the Result to get the original ID from the indexed Id.
val resultDf = resultRaw.join(idDf,resultRaw.col("idIndex")===idDf.col("idIndex")).select("id","recommendations")
Vishnu667
  • 768
  • 1
  • 16
  • 39