What is the best practice to collect a large data set from spark rdd?

Question

I am using pyspark to process my data and at the very end i need collect data from rdd using rdd.collect(). However, my spark crashes due to the memory problem. I tried a number of ways, but no luck. I am now running with the following code, process a small chunk of data for each partition:

def make_part_filter(index):
    def part_filter(split_index, iterator):
        if split_index == index:
            for el in iterator:
                yield el
    return part_filter


for part_id in range(rdd.getNumPartitions()):
    part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)
    myCollection = part_rdd.collect()
    for row in myCollection:
          #Do something with each row

The new code I am currently using does not crash, but seems running forever.

Is there a better way to collect data from a large rdd?

Instead of running a for loop on the list format of the RDD, why don't you run a map function instead? — Saif Charaniya, May 21 '16 at 22:43
Actually, I need collect all data in rdd and store in a large array and then feed a machine learning module. — JamesLi, May 21 '16 at 23:21
Perchance does the machine learning module accept an iterator, or does it really want an array? With an iterator you could avoid having to load all the data into memory at once. Even then, I'd be worried about performance since I'm assuming the machine learning module is going to "eat" the data with a single thread. — E.F.Walker, May 21 '16 at 23:39
What machine learning algorithm are we talking about? The idea of spark is that you run it in a distributed fashion. — z-star, May 22 '16 at 03:45
I'm really surprised this isn't built in or at least asked around more. I have a huge RDD that I want to append to a file on disk, but my master node can't fit the whole thing in memory! — sudo, Apr 18 '17 at 00:11

sudo · Answer 1 · 2017-04-18T00:58:31.977

I don't know if this is the best way, but it's the best way I've tried. Not sure if it's better or worse than yours. Same idea, splitting it into chunks, but you can be more flexible with the chunk size.

def rdd_iterate(rdd, chunk_size=1000000):
    indexed_rows = rdd.zipWithIndex().cache()
    count = indexed_rows.count()
    print("Will iterate through RDD of count {}".format(count))
    start = 0
    end = start + chunk_size
    while start < count:
        print("Grabbing new chunk: start = {}, end = {}".format(start, end))
        chunk = indexed_rows.filter(lambda r: r[1] >= start and r[1] < end).collect()
        for row in chunk:
            yield row[0]
        start = end
        end = start + chunk_size

Example usage where I want to append a huge RDD to a CSV file on disk without ever populating a Python list with the entire RDD:

def rdd_to_csv(fname, rdd):
    import csv
    f = open(fname, "a")
    c = csv.writer(f)
    for row in rdd_iterate(rdd): # with abstraction, iterates through entire RDD
        c.writerows([row])
    f.close()

rdd_to_csv("~/test.csv", my_really_big_rdd)

I'm trying to solve a similar problem, and I found your code helpful! Thank you! — Nahuel Chaves, May 29 '20 at 15:35
You're welcome. I remember using that code for a previous job, and it was very reliable. — sudo, Jun 08 '20 at 00:28

E.F.Walker · Answer 2 · 2016-05-22T17:10:20.877

Trying to "collect" a huge RDD is problematic. "Collect" returns a list, which implies the entire RDD content has to be stored in the driver's memory. This is a "showstopper" problem. Typically one wants a Spark application to be able to process data sets whose size is well beyond what would fit in a single node's memory.

Let's suppose the RDD barely fits into memory, and "collect" works. Then we have another "showstopper" --- low performance. In your code, the collected RDD is processed in a loop: "for row in myCollection". This loop is executed by exactly one core. So instead of processing the data via an RDD, whose computations get distributed amongst all the cores of the cluster, of which there are probably 100's if not 1000's --- instead all the work on the entire dataset is placed on the back of a single core.

What is the best practice to collect a large data set from spark rdd?

2 Answers2

Linked