Combining RDD's with some values missing

Question

Hi I have two RDD's I want to combine into 1. The first RDD is of the format

//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
  ((user, mov), rate)
}

I have another RDD

//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))

So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.

score 0 · Answer 1 · answered Mar 10 '17 at 13:09

You can do something like this -

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)

val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))

val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))

val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)

// Convert to Dataframe so it is easier to handle    
val df1 = rdd1.toDF
val df2 = rdd2.toDF

// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|   1|
|  U2| M2|   3|
|  U3| M1|   3|
|  U3| M2|   1|
|  U4| M2|   2|
+----+---+----+

df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|  -1|
|  U5| M4|   3|
|  U6| M6|  -1|
|  U3| M2|  -1|
|  U4| M2|  -1|
|  U4| M3|   5|
+----+---+----+

// Figure out the extra reviews in second dataframe that do not match (user, mov) in first    
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")

// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
    val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
    a.select(columns: _*).union(b.select(columns: _*))
}

// Final result of combining only unique values in df2    
unionByName(df1, xtraReviews).show

+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|   1|
|  U2| M2|   3|
|  U3| M1|   3|
|  U3| M2|   1|
|  U4| M2|   2|
|  U5| M4|   3|
|  U4| M3|   5|
|  U6| M6|  -1|
+----+---+----+

score 0 · Answer 2 · answered Mar 11 '17 at 01:23

It might also be possible to do it in this way:

RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).

Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

Combining RDD's with some values missing

2 Answers2