0

I have created a RDD like follows:

scala> val x = List(1,2,3,4)
x: List[Int] = List(1, 2, 3, 4)

scala> val y = sc.parallelize(ls,2)
y: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:29

scala> val z = y.map( c => c*2)
z: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at map at <console>:31

scala> sc.setCheckpointDir("/tmp/chk")

scala> z.checkpoint

scala> z.count
res32: Long = 4

My question is how to read data from checkpoint directory

shakedzy
  • 2,853
  • 5
  • 32
  • 62
sraj
  • 9
  • 1
  • 1
    Why do you want to do that? – Yuval Itzchakov Aug 28 '16 at 11:53
  • I have read at many places that check-pointed data can be read when application is completed. So just created above code and check-pointed the "z". its created a directory "chk" and also a second directory with big name after the z.count. After that big directory it created a directory rdd-2 inside that there was part-00000 file. So after that i closed scala console and reopen it. But i was not able to read the RDD in part-0000 data. So I want to know how to read RDD from part-00000 file. I am just researching – sraj Aug 29 '16 at 10:50

1 Answers1

-2

As @Yuval Itzchakov points out, we dont really need to play around with checkpoints. Checkpoints are used by Spark to achieve fault-tolerance. It is extensively used in streaming jobs for checkpointing state and when an executor fails, a new one can be spawned and the data can be loaded from the checkpoints.

Checkpoints also have a problem when you change your code and want to continue where your last job run was left off, as it stores the code along with the state.

Are you actually looking at persistor cache of an RDD, instead ?

  • I have read at many places that check-pointed data can be read when application is completed. So just created above code and check-pointed the "z". its created a directory "chk" and also a second directory with big name after the z.count. After that big directory it created a directory rdd-2 inside that there was part-00000 file. So after that i closed scala console and reopen it. But i was not able to read the RDD in part-0000 data. So I want to know how to read RDD from part-00000 file. I am just researching. As every where it is mentioned that we can readRDDafter completinofapplication. – sraj Aug 29 '16 at 10:52
  • Hi ramkumar so is this feasible to read data from chekpoint location – sraj Aug 30 '16 at 03:07