checkpointing DataFrames in SparkR

Question

I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union/rbind). However, as the loop runs, the lineage of the data gets longer and longer as spark remembers all the previous datasets and filter()-s.

Is there a way to do checkpointing in spark R API? I have learned that spark 2.1 has checkpointing for DataFrames but this seems not to be made available from R.

score 1 · Accepted Answer · answered Mar 15 '17 at 00:44

1

We got the same issue with Scala/GraphX on a quite large graph (few billions of data) and the search for connected components .

I'm not sure what is available in R for your specific version, but a usual workaround is to break the lineage by "saving" the data then reloading it. In our case, we break the lineage every 15 iterations:

def refreshGraph[VD: ClassTag, ED: ClassTag](g: Graph[VD, ED], checkpointDir: String, iterationCount: Int, numPartitions: Int): Graph[VD, ED] = {
    val path = checkpointDir + "/iter-" + iterationCount
    saveGraph(g, path)
    g.unpersist()
    loadGraph(path, numPartitions)
}

answered Mar 15 '17 at 00:44

glefait

1,651
1
13
11

Thanks, I see you use different file at each loop. I tried it earlier with the same file so it did not work. I will test and accept your answer if it works for me. – Ott Toomet Mar 15 '17 at 16:40
It seems to be working now :-) Although I run into another bottleneck :-( – Ott Toomet Mar 15 '17 at 22:54
You can either edit your first question or link to your new issue – glefait Mar 15 '17 at 23:11

score 0 · Answer 2 · answered Mar 15 '17 at 16:44

0

An incomplete solution/workaround is to collect() your dataframe into an R object, and later re-parallelize by createDataFrame(). This works well for small data but for larger datasets it become too slow and complains about too large tasks.

answered Mar 15 '17 at 16:44

Ott Toomet

1,894
15
25

checkpointing DataFrames in SparkR

2 Answers2