2

In pyspark, what would be the easiest way to save a RDD as a file, so later on I can read the file back as an RDD?

I saw a lot of save method such as saveAsPickle, saveAsSequence, saveAsObject ... etc. But not much mentioning about how to read these files back to an RDD. Please advise.

Thanks!

Edamame
  • 23,718
  • 73
  • 186
  • 320
  • Do you want to save to hdfs or to a local file? – mattinbits Oct 17 '16 at 20:12
  • @Edamame use approach used here: http://stackoverflow.com/questions/40069264/how-can-i-save-an-rdd-into-hdfs-and-later-read-it-back/40069468#40069468 Just put part-* as file expression, nothing special is needed :) Works both with sequenceFiles and textFiles – T. Gawęda Oct 17 '16 at 20:17
  • @Edamame And, as mentioned there, consider using DataFrames if you have some schema - DataFrames are a lot faster – T. Gawęda Oct 17 '16 at 20:18
  • @mattinbits It doesn't matter, change `file://` to `hdfs://` and will work ;) – T. Gawęda Oct 17 '16 at 20:25
  • 1
    @T.Gawęda that is not exactly true. If you are working on a distributed cluster, `file://` will cause the worker to save it's part of the RDD to the local file system of the worker. A "local file" in this context means a file local to the driver. That is, does the asker want to collect the data and save it to a file local to the driver program or does he want to save the RDD in a distributed fashion. – mattinbits Oct 18 '16 at 08:31
  • @mattinbits You're right, my thinking was in opposite way (if it work on local storage then will work on distributed). However, saving to local storage is not recommended – T. Gawęda Oct 18 '16 at 08:42

0 Answers0