pyspark: easiest way to save and read an RDD

Asked Oct 17 '16 at 19:42

Active Oct 17 '16 at 19:42

Viewed 1,130 times

In pyspark, what would be the easiest way to save a RDD as a file, so later on I can read the file back as an RDD?

I saw a lot of save method such as saveAsPickle, saveAsSequence, saveAsObject ... etc. But not much mentioning about how to read these files back to an RDD. Please advise.

Thanks!

asked Oct 17 '16 at 19:42

Edamame

23,718
73
186
320

Do you want to save to hdfs or to a local file? – mattinbits Oct 17 '16 at 20:12
@Edamame use approach used here: http://stackoverflow.com/questions/40069264/how-can-i-save-an-rdd-into-hdfs-and-later-read-it-back/40069468#40069468 Just put part-* as file expression, nothing special is needed :) Works both with sequenceFiles and textFiles – T. Gawęda Oct 17 '16 at 20:17
@Edamame And, as mentioned there, consider using DataFrames if you have some schema - DataFrames are a lot faster – T. Gawęda Oct 17 '16 at 20:18
@mattinbits It doesn't matter, change `file://` to `hdfs://` and will work ;) – T. Gawęda Oct 17 '16 at 20:25
1

@T.Gawęda that is not exactly true. If you are working on a distributed cluster, `file://` will cause the worker to save it's part of the RDD to the local file system of the worker. A "local file" in this context means a file local to the driver. That is, does the asker want to collect the data and save it to a file local to the driver program or does he want to save the RDD in a distributed fashion. – mattinbits Oct 18 '16 at 08:31
@mattinbits You're right, my thinking was in opposite way (if it work on local storage then will work on distributed). However, saving to local storage is not recommended – T. Gawęda Oct 18 '16 at 08:42

pyspark: easiest way to save and read an RDD

0 Answers0