RDD Partitioning

Question

Suppose we have a file on HDFS having 3 blocks(64mb each). When we create a RDD using the same file with 3 partitions, then each node on the cluster(suppose cluster is having 3 data nodes) will have duplicate file contents( one block from hdfs and a partition of RDD)

Is that true understanding? Please clarify. – Abhinav Kumar Nov 12 '16 at 12:51 — Abhinav Kumar, Nov 12 '16 at 12:51

score 0 · Accepted Answer · answered Nov 12 '16 at 13:43

In HDFS blocks are distributed randomly (by default and if the client where you put the file is not part of the cluster) so you can't be sure every node has 1 block unless you have replica 3. In this case, every block will be placed in three nodes.

Regarding Spark, by default, Spark tries to read data into an RDD from the nodes that are close to it and tries to spread rdd partitions across the cluster.

Your assumption is not always true, you have to consider HDFS block distribution with replica placement strategy, spark executors, etc. However, it would be true, if you have replica 3 in HDFS and you have a Spark cluster with 3 workers one in every node in the cluster.

RDD Partitioning

1 Answers1