I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union
/rbind
). However, as the loop runs, the lineage of the data gets longer and longer as spark remembers all the previous datasets and filter()
-s.
Is there a way to do checkpointing in spark R API? I have learned that spark 2.1 has checkpointing for DataFrames but this seems not to be made available from R.