I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect()
would cause an error.
Is there a way to recreate the collect()
operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take()
100000 rows at a time from a cached RDD, but I don't think take()
allows you to specify a start position?