I have an RDD
(or DataFrame
) of measuring data which is ordered by the timestamp, and I need to do a pairwise operation on two subsequent records for the same key (e.g., doing a trapezium integration of accelerometer data to get velocities).
Is there a function in Spark that "remembers" the last record for each key and has it available when the next record for the same key arrives?
I currently thought of this approach:
- Get all the keys of the RDD
- Use a custom
Partitioner
to partition the RDD by the found keys so I know there is one partition for each key - Use
mapPartitions
to do the calculation
However this has one flaw:
First, getting the keys can be a very lengthy task because the input data can be several GiB or even TiB large. I could write a custom InputFormat
to just extract the keys which would be significantly faster (as I use Hadoop's API and sc.newAPIHadoopFile
to get the data in the first place) but that would be additional things to consider and an additional source of bugs.
So my question is: Is there anything like reduceByKey
that doesn't aggregate the data but just gives me the current record and the last one for that key and lets me output one or more records based on that information?