collect RDD with buffer in pyspark

Question

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an error.

Is there a way to recreate the collect() operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take() 100000 rows at a time from a cached RDD, but I don't think take() allows you to specify a start position?

Is there something that makes you want to avoid "saveAsTextFile" ? Because you could flush everything to a file and then read it through a buffer. — Paul K., Nov 19 '15 at 19:18
@paul-k I currently use saveAsTextFile, however this has a couple problems: 1) the reading time is slow, because these are text files, and 2) I lose information about datatypes, so '1' is the same as 1 — mgoldwasser, Nov 19 '15 at 19:44
That is true 2) is still an issue but you can still write type information even though this is not very economic in terms of file size. you could also call SaveAsPickleFile to serialize objects. 1) I don't think this would be slower than the actual implementation of 'collect' since it reads from a temp file according to the docs : ps://spark.apache.org/docs/0.7.2/api/pyspark/pyspark.rdd-pysrc.html#RDD.collect — Paul K., Nov 19 '15 at 19:55

score 7 · Accepted Answer · answered Nov 19 '15 at 20:58

The best available option is to use RDD.toLocalIterator which collects only a single partition at the time. It creates a standard Python generator:

rdd = sc.parallelize(range(100000))
iterator = rdd.toLocalIterator()
type(iterator)

## generator

even = (x for x in iterator if not x % 2)

You can adjust amount of data collected in a single batch using a specific partitioner and adjusting a number of partitions.

Unfortunately it comes with a price. To collect small batches you have to start multiple Spark jobs and it is quite expensive. So generally speaking collecting an element at the time is not an option.

Just wanted to add a small note, this works great with `glom()` if you want an iterator that returns one list per partition. — numeral, Oct 18 '16 at 21:23

collect RDD with buffer in pyspark

1 Answers1