2

I have an RDD (or DataFrame) of measuring data which is ordered by the timestamp, and I need to do a pairwise operation on two subsequent records for the same key (e.g., doing a trapezium integration of accelerometer data to get velocities).

Is there a function in Spark that "remembers" the last record for each key and has it available when the next record for the same key arrives?

I currently thought of this approach:

  1. Get all the keys of the RDD
  2. Use a custom Partitioner to partition the RDD by the found keys so I know there is one partition for each key
  3. Use mapPartitions to do the calculation

However this has one flaw:

First, getting the keys can be a very lengthy task because the input data can be several GiB or even TiB large. I could write a custom InputFormat to just extract the keys which would be significantly faster (as I use Hadoop's API and sc.newAPIHadoopFile to get the data in the first place) but that would be additional things to consider and an additional source of bugs.

So my question is: Is there anything like reduceByKey that doesn't aggregate the data but just gives me the current record and the last one for that key and lets me output one or more records based on that information?

rabejens
  • 7,594
  • 11
  • 56
  • 104

1 Answers1

1

Here is what you can do with dataframe

import java.sql.Timestamp
import org.apache.spark.sql.types.{TimestampType, IntegerType}
import org.apache.spark.sql.functions._

**Create a window for lag function** 
val w = org.apache.spark.sql.expressions.Window.partitionBy("key").orderBy("timestamp")

val df = spark.sparkContext.parallelize(List((1, 23, Timestamp.valueOf("2017-12-02 03:04:00")),
(1, 24, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 26, Timestamp.valueOf("2017-12-02 01:45:20")),
(1, 27, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 30, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 33, Timestamp.valueOf("2017-12-02 01:45:20")),
(2, 39, Timestamp.valueOf("2017-12-02 01:45:20")))).toDF("key","value","timestamp")

scala> df.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: integer (nullable = false)
 |-- timestamp: timestamp (nullable = true)


scala> val lagDF = df.withColumn("lag_value",lag("value", 1, 0).over(w))
lagDF: org.apache.spark.sql.DataFrame = [key: int, value: int ... 2 more fields]

**Previous record and current record is in same row now**
scala> lagDF.show(10, false)
+---+-----+-------------------+---------+
|key|value|timestamp          |lag_value|
+---+-----+-------------------+---------+
|1  |24   |2017-12-02 01:45:20|0        |
|1  |26   |2017-12-02 01:45:20|24       |
|1  |27   |2017-12-02 01:45:20|26       |
|1  |23   |2017-12-02 03:04:00|27       |
|2  |30   |2017-12-02 01:45:20|0        |
|2  |33   |2017-12-02 01:45:20|30       |
|2  |39   |2017-12-02 01:45:20|33       |
+---+-----+-------------------+---------+

**Put your distance calculation logic here. I'm putting dummy function for demo**
scala> val result = lagDF.withColumn("dummy_operation_for_dist_calc", lagDF("value") - lagDF("lag_value"))
result: org.apache.spark.sql.DataFrame = [key: int, value: int ... 3 more fields]

scala> result.show(10, false)
+---+-----+-------------------+---------+-----------------------------+
|key|value|timestamp          |lag_value|dummy_operation_for_dist_calc|
+---+-----+-------------------+---------+-----------------------------+
|1  |24   |2017-12-02 01:45:20|0        |24                           |
|1  |26   |2017-12-02 01:45:20|24       |2                            |
|1  |27   |2017-12-02 01:45:20|26       |1                            |
|1  |23   |2017-12-02 03:04:00|27       |-4                           |
|2  |30   |2017-12-02 01:45:20|0        |30                           |
|2  |33   |2017-12-02 01:45:20|30       |3                            |
|2  |39   |2017-12-02 01:45:20|33       |6                            |
+---+-----+-------------------+---------+-----------------------------+
m-bhole
  • 1,189
  • 10
  • 21
  • Combining this with the cumulative sum as per the answer in https://stackoverflow.com/questions/47874067/cumulative-sum-in-spark?rq=1 I can calculate velocity from accelerations. Nice one! – rabejens Jun 21 '18 at 16:34
  • glad, it helped! – m-bhole Jun 21 '18 at 16:45
  • I tried this with a larger amount of data, but now I get a ton of `18/06/22 10:10:30 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096` and eventually an OOM. What can I do about this? – rabejens Jun 22 '18 at 08:10
  • Please set spark.sql.windowExec.buffer.spill.threshold to highre value. Default is 4096. What version of spark you are on? As per https://issues.apache.org/jira/browse/SPARK-21595 this issue was fixed in spark 2.2.1 and spark 2.3.0 – m-bhole Jun 22 '18 at 08:18
  • Oh, I am using 2.2.0, so I'll bump the version and play with these settings. – rabejens Jun 22 '18 at 08:26
  • In the meantime I tried another approach not using dataframes but RDDs, where I first sort the data by time, then use a custom partitioner to partition by channel name, and use `mapPartitions` to calculate the velocity for each partition, using an old school `var` to hold the state. Since I know which channels I have, so I know how I can programmatically partition the data, this works too. – rabejens Jun 22 '18 at 08:39