3

I have a RDD which would be like ((String, String), TimeStamp). I have large number of records and I want to select for each key the record with latest TimeStamp value. I have tried the following code and still struggling to to this. Can anybody help me to do this ?

The below code I tried is wrong and not working as well

val context = sparkSession.read.format("jdbc")
  .option("driver", "com.mysql.jdbc.Driver")
  .option("url", url)
  .option("dbtable", "student_risk")
  .option("user", "user")
  .option("password", "password")
  .load()
context.cache();

val studentRDD = context.rdd.map(r => ((r.getString(r.fieldIndex("course_id")), r.getString(r.fieldIndex("student_id"))), r.getTimestamp(r.fieldIndex("risk_date_time"))))
val filteredRDD = studentRDD.collect().map(z => (z._1, z._2)).reduce((x, y) => (x._2.compareTo(y._2)))
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
Kepler
  • 399
  • 1
  • 7
  • 19
  • z => (z._1, z._2)) isn't correct. This piece pulls out course_id and student_id instead. You have to stretch out for the third variable. I'm not good at scala, so couldn't provide with an exact piece of code. – srikanth Feb 16 '17 at 08:53
  • BTW - the title states you're looking for "earliest", while the text says "latest" - I answered based on title, obviously this can easily be changed. – Tzach Zohar Feb 16 '17 at 09:16
  • Little problem on conveying my requirement using the language. What I need to get display the record with earliest timestamp value match with each key. – Kepler Feb 16 '17 at 09:29

2 Answers2

7

It's easy to do directly on the DataFrame (oddly named context here):

val result = context
  .groupBy("course_id", "student_id")
  .agg(min("risk_date_time") as "risk_date_time")

Then you can convert it into RDD (if needed) as you did before - the result has the same schema.

If you DO want to perform this over the RDD, use reduceByKey:

studentRDD.reduceByKey((t1, t2) => if (t1.before(t2)) t1 else t2)
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • it gives an copilation error when try with data frame as you mentioned as the first option? – Kepler Feb 16 '17 at 09:49
  • What fails to compile? You may need to add `import org.apache.spark.sql.functions._` to get the `min` function in scope – Tzach Zohar Feb 16 '17 at 09:50
  • I need to make clear one thing, do t1 and t2 refers to the two records of the RDD and does t1 represents the composite student_id and course_id ? – Kepler Feb 16 '17 at 09:57
  • since I'm using `reduceByKey`, it treats each record as a (key, value) tuple, and the reduce function operates on the _values_ - so, t1 and t2 are the _values_ (Timestamps!) of any two records with an identical key. So for each key, all matching records will go through this function until the resulting record (value, of type Timestamp) remains – Tzach Zohar Feb 16 '17 at 10:02
3

First your code provide incorrect results because the reduce is incorrect. The reduce function returns an int (from compareTo) instead of the pair x,y but int has no ._2 member. To correct this try:

  studentRDD.collect().map(z => (z._1, z._2)).reduce((x ,y) => if (x._2.compareTo(y._2) < 0) x else y)._1

Basically this new function would return the record with the smaller time and then on the overall result (the smallest) you take the key.

Note that you are doing all of this on the driver because of the collect. There is no reason to collect, map and reduce work on RDD so you can get the same result (and still be scalable) by doing this: studentRDD.map(z => (z._1, z._2)).reduce((x ,y) => if (x._2.compareTo(y._2) < 0) x else y)._1

You can do this directly from your context dataframe though:

val targetRow = context.agg(min(struct('risk_date_time, 'course_id, 'student_id)) as "rec").select($"rec.*").collect()(0)
val key = (targetRow.getString(1), targetRow.getString(2))
Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56