I have a (pretty large, think 10e7 Rows) DataFrame from which i filter elements based on some property
val res = data.filter(data(FieldNames.myValue) === 2).select(pk.name, FieldName.myValue)
My DataFrame has n Partitions data.rdd.getNumPartitions
Now i want to know from which partition my rows originated. I am aware that I could just iterate through all partitions with something like this
val temp = res.first() //or foreach, this is just an example
data.foreachPartition(f => {
f.exists(row => row.get(0)==temp.get(0))
//my code here
}) //compare PKs
or data.rdd.mapPartitionsWithIndex((idx, f) => ...)
However, this seems excessive and also not very performant if my results and my DataFrame becomes large.
Is there a Spark-way to do this after i've performed the filter()-operation?
Or alternatively, is there a way to rewrite/ an alternative to the filter()-statement so that it returns the origin of the row?
I could also save the partition location in my DataFrame and update that on repartitioning, but i'd rather do it in a spark-way
(The only similar question i found was here, and neither the question nor the comment is very helpful. I also found this which might be similar but not the same)
Thanks in advance for any help/pointers and i apologize if i missed a question similar to mine that has been answered already.