Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

Question

I have a spark SQL dataset whose schema defined as follows,

User_id <String> | Item_id <String> | Bought_Status <Boolean>

I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I read that CoordinateMatrix is the right way to create a sparse matrix out of this.

However I got stuck at a point where the API doc says that RDD[MatrixEntry] is mandatory to create a CoordinateMatrix. Also MatrixEntry needs a format of int,int, long.

I am not able to convert my data scheme to this format. Can you please help me on how to convert this data to a sparse matrix in spark? I am currently programming using scala

score 1 · Accepted Answer · answered Jun 21 '18 at 05:47

Please note that matrix entity is of type long, long, double

Reference: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry

Also, as user/ item columns are string, those need to be indexed before processing. Here is how you can create coordinatematrix with scala:

//Imports needed
scala> import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix

scala> import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.MatrixEntry

scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer

//Let's create a dummy dataframe
scala> val df = spark.sparkContext.parallelize(List(
     | ("u1","i1" ,true),
     | ("u1","i2" ,true),
     | ("u2","i3" ,false),
     | ("u2","i4" ,false),
     | ("u3","i1" ,true),
     | ("u3","i3" ,true),
     | ("u4","i3" ,false),
     | ("u4","i4" ,false))).toDF("user","item","bought")
df: org.apache.spark.sql.DataFrame = [user: string, item: string ... 1 more field]

scala> df.show
+----+----+------+
|user|item|bought|
+----+----+------+
|  u1|  i1|  true|
|  u1|  i2|  true|
|  u2|  i3| false|
|  u2|  i4| false|
|  u3|  i1|  true|
|  u3|  i3|  true|
|  u4|  i3| false|
|  u4|  i4| false|
+----+----+------+

//Index user/ item columns
scala> val indexer1 = new StringIndexer().setInputCol("user").setOutputCol("userIndex")
indexer1: org.apache.spark.ml.feature.StringIndexer = strIdx_2de8d35b8301

scala> val indexed1 = indexer1.fit(df).transform(df)
indexed1: org.apache.spark.sql.DataFrame = [user: string, item: string ... 2 more fields]

scala> val indexer2 = new StringIndexer().setInputCol("item").setOutputCol("itemIndex")
indexer2: org.apache.spark.ml.feature.StringIndexer = strIdx_493ce45dbec3

scala> val indexed2 = indexer2.fit(indexed1).transform(indexed1)
indexed2: org.apache.spark.sql.DataFrame = [user: string, item: string ... 3 more fields]

scala> val tempDF = indexed2.withColumn("userIndex",indexed2("userIndex").cast("long")).withColumn("itemIndex",indexed2("itemIndex").cast("long")).withColumn("bought",indexed2("bought").cast("double")).select("userIndex","itemIndex","bought")
tempDF: org.apache.spark.sql.DataFrame = [userIndex: bigint, itemIndex: bigint ... 1 more field]

scala> tempDF.show
+---------+---------+------+
|userIndex|itemIndex|bought|
+---------+---------+------+
|        0|        1|   1.0|
|        0|        3|   1.0|
|        1|        0|   0.0|
|        1|        2|   0.0|
|        2|        1|   1.0|
|        2|        0|   1.0|
|        3|        0|   0.0|
|        3|        2|   0.0|
+---------+---------+------+

//Create coordinate matrix of size 4*4
scala> val corMat = new CoordinateMatrix(tempDF.rdd.map(m => MatrixEntry(m.getLong(0),m.getLong(1),m.getDouble(2))), 4, 4)
corMat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@16be6b36

//Check the content of coordinate matrix
scala> corMat.entries.collect
res2: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(0,1,1.0), MatrixEntry(0,3,1.0), MatrixEntry(1,0,0.0), MatrixEntry(1,2,0.0), MatrixEntry(2,1,1.0), MatrixEntry(2,0,1.0), MatrixEntry(3,0,0.0), MatrixEntry(3,2,0.0))

Hope, this helps!

Thanks for your answer hadooper. – Rengasami Ramanujam Jul 08 '18 at 15:11 — Rengasami Ramanujam, Jul 08 '18 at 15:11

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

1 Answers1