0

How do I iterate over a DataSet in Spark 2.0 and scala? My problem is - I need to compare two rows. I need to compare DateN and DateN-1 and calculate the difference.

 Row1 - Date1 Num1 
 Row2 - Date2 Num2
 ..
 RowN- DateN NumN
coder AJ
  • 1
  • 4
  • does your df contain only two rows? if not what exactly do you want to answer given the data? pls elaborate more on the problem as there are planty methods available – elcomendante Feb 12 '17 at 15:12
  • No. That's just an example. My DS has many rows. As i mentioned above I need to compare two dates from two rows in a iteration in scala and find their difference. – coder AJ Feb 12 '17 at 15:54
  • You want "window functions". See, for example, https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html – The Archetypal Paul Feb 12 '17 at 16:28
  • Thank you...will take a look – coder AJ Feb 12 '17 at 17:24

1 Answers1

0

Not sure, whether you resolved issue using window function as you just want to compare n & n-1 rows and I dont see attribute on which you want to group the data. For your described requirement, you can resolve issue as follows:

  1. Add index to the rdd using zipWithIndex.
  2. Create rdd for odd indexed rows.
  3. Create rdd for even index rows.
  4. Now you can apply your logic on two rdds.1

Following is the working example :

 val spark = SparkSession
                    .builder
                    .appName("Example")
                    .master("local[*]")
                    .getOrCreate()
                    import spark.implicits._
    val customers = spark.sparkContext.parallelize(List(("Alice", "2016-05-01", 50.00),
                                        ("Alice", "2016-05-03", 45.00),
                                        ("Alice", "2016-05-04", 55.00),
                                        ("Bob", "2016-05-01", 25.00),
                                        ("Bob", "2016-05-04", 29.00),
                                        ("Bob", "2016-05-06", 27.00)))

   val custIndexed = customers.zipWithIndex().collect()
   val custOdd = custIndexed.filter(record=>record._2%2!=0)
   val custEven = custIndexed.filter(record=>record._2%2==0)
Nikhil Bhide
  • 728
  • 8
  • 23