3

Identify a partition :

mapPartitionsWithIndex(index, iter)

The method results into driving a function onto each partition. I understand that we can track the partition using "index" parameter.

Numerous examples have used this method to remove the header in a data set using "index = 0" condition. But how do we make sure that the first partition which is read (translating, "index" parameter to be equal to 0) is indeed the header. Isint it random or based upon the partitioner, if used.

user4157124
  • 2,809
  • 13
  • 27
  • 42
Kanav Sharma
  • 307
  • 1
  • 5
  • 13

1 Answers1

7

Isn't it random or based upon the partitioner, if used?

It is not random but partitioner number. You can understand it with below mentioned simple example

val base = sc.parallelize(1 to 100, 4)    
base.mapPartitionsWithIndex((index, iterator) => {

  iterator.map { x => (index, x) }

}).foreach { x => println(x) }

Result : (0,1) (1,26) (2,51) (1,27) (0,2) (0,3) (0,4) (1,28) (2,52) (1,29) (0,5) (1,30) (1,31) (2,53) (1,32) (0,6) ... ...

Balaji Reddy
  • 5,576
  • 3
  • 36
  • 47
  • As I understand, number from 1 to 25 are in one partition with index equals to 0. My question is that is this a surety that first 25 numbers are grouped together and they also go into partition (0). @bdr – Kanav Sharma Jun 13 '17 at 13:49
  • it is depends on how your data is partitioned. in my example, its just numbers so its 100/4. but in case of string, then hash partitioner. The bottom line is depends on your partitioner. In case of paired RDDs, default Partitioner is Hash Partitioner. – Balaji Reddy Jun 13 '17 at 14:02
  • So, unless implemented/stated otherwise, it is safe to assume that `index=0` will give the first line. @BDR – Kanav Sharma Jun 13 '17 at 14:11