What is a glom?. How it is different from mapPartitions?

Question

I've come across the glom() method on RDD. As per the documentation

Return an RDD created by coalescing all elements within each partition into an array

Does glom shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using mapPartitions.

I would also like to know if there are any use cases that benefit from glom.

@zero323 explained in a nice way ... wanted to add imp. tip i.e. `glom` is useful when you want to implement RDD operations using matrix libraries that are optimized to operate on arrays — Ram Ghadiyaram, Sep 06 '17 at 20:10

score 14 · Accepted Answer · answered Mar 02 '16 at 04:47

14

Does glom shuffle the data across partitions

No, it doesn't

If this is the second case I believe that the same can be achieved using mapPartitions

It can:

rdd.mapPartitions(iter => Iterator(_.toArray))

but the same thing applies to any non shuffling transformation like map, flatMap or filter.

if there are any use cases which benefit from glob.

Any situation where you need to access partition data in a form that is traversable more than once.

answered Mar 02 '16 at 04:47

zero323

322,348
103
959
935

Can't we traverse more than once from the output of mapPartitions or map or filter? – nagendra Mar 02 '16 at 05:20
Not exactly what I mean. Lets say you have a functions `(vs: T) => for { x <- vs; y <- vs } yield (x, y)` and you want to apply it to the complete partitions. You can simply `rdd.glom.map(f)` instead of converting Iterator inside mapPartitions. But in general it is not a crucial function. – zero323 Mar 02 '16 at 05:27

score 9 · Answer 2 · answered May 24 '19 at 05:24

9

glom() transforms each partition into a tuple (immutabe list) of elements. It creates an RDD of tuples. One tuple per partition.

answered May 24 '19 at 05:24

kriti arora

109
1
5

score 0 · Answer 3 · answered Feb 03 '20 at 21:04

"... Glom() In general, spark does not allow the worker to refer to specific elements of the RDD. Keeps the language clean, but can be a major limitation. glom() transforms each partition into a tuple (immutabe list) of elements. Creates an RDD of tules. One tuple per partition. workers can refer to elements of the partition by index. but you cannot assign values to the elements, the RDD is still immutable. Now we can understand the command used above to count the number of elements in each partition. We use glom() to make each partition into a tuple. We use len on each partition to get the length of the tuple - size of the partition. * We collect the results to print them out.

score 0 · Answer 4 · answered Aug 04 '21 at 05:24

The glom() function returns an RDD that is created by grouping all elements within each partition into a list(called tuple as it is an immutable list). You can sort it like this - rdd = sc.parallelize([1, 2, 3, 4], 2) sorted(rdd.glom().collect()) [[1, 2], [3, 4]]

What is a glom?. How it is different from mapPartitions?

4 Answers4

Linked