1

I have an Rdd[String] and I want to shuffle all of the rows of this Rdd. How do I achieve this?

For example:

RDD object named rdd and you can run: rdd.collect.foreach(t => println(t)) has output:

1

2

3

4

I want to shuffe the rows of rdd so that running rdd.collect.foreach(t => println(t)) after the shuffle is like:

3

4

1

2

user3494047
  • 1,643
  • 4
  • 31
  • 61

1 Answers1

1

You aren't really shuffling the RDD. It doesn't make much conceptual sense to shuffle an RDD directly since the data is partitioned and there are no guarantees about order in that case. You can look into a custom partitioner if that's the route you'd like to take.

Now, by performing the collect(), you have converted this into a Scala collection. You can use standard collection libraries to shuffle the data.

Collections.shuffle(rdd.collect).foreach(t=>println(t))
micker
  • 878
  • 6
  • 13