I have the following dataframe :df
In some point I need to filter out items base on timestamps(milliseconds). However it is important to me to save how much records werefiltered(In case it is too many I want to fail the job) Naively I can do:
======Lots of calculations on df ======
val df_filtered = df.filter($"ts" >= startDay && $"ts" <= endDay)
val filtered_count = df.count - df_filtered.count
However it feels like complete overkill since SPARK will perform the whole execution tree, 3 times (filter and 2 counts). This task in Hadoop MapReduce is really easy since I can maintain counter for each row filtered. Is there more efficient way, I could only find accumulators but I can't connect it to filter.
A suggested approach was to cache df before the filter however I would prefer this option as last resort due to DF size.