How does Spark partition's HDFS files?

Question

If we have an uncompressed 320 blocks of HDFS files stored on a 16 data node cluster. Each node with 20 blocks and if we use Spark to read this file into an RDD (without explicitly passing numPartitions when creating an RDD) textFile = sc.textFile("hdfs://input/war-and-peace.txt")

If we have 16 executors one on each node, how many partitions Spark RDD will create per executor? Will it create one partition per HDFS block i.e. 20 partitions?

score 3 · Answer 1 · edited May 23 '17 at 12:01

3

If you have 320 blocks of HDFS, then following code will create an RDD with 320 partitions:

val textFile = sc.textFile("hdfs://input/war-and-peace.txt")

textFile() method results in an RDD that is partitioned into the same number of blocks as the file is stored on in HDFS.

You may look into this question which may solve your queries about partitioning

edited May 23 '17 at 12:01

Community

1
1

answered Dec 08 '16 at 17:26

bob

4,595
2
25
35

What does it mean by 320 RDD partitions? My understanding was that, in the scenario I mentioned, it would create a total of 16 distributed RDDs one on each slave node and each RDD on each slave node would contain 20 blocks of HDFS files together? – zoe Dec 08 '16 at 22:52
It will not create 16RDDs, it will create just one RDD with 320 partitions and the partitions will be distributed across the slave nodes. – bob Dec 09 '16 at 05:13
Your answer probably is correct but I still don't understand. You have mentioned "Spark Partitions will be distributed across slave nodes". If it was a Scala or Java collection I get it but my question is why does Spark distribute HDFS files further when HDFS files are already distributed as blocks on slave nodes? Part two of the question now is which node this single RDD of 320 partitions live on Master? or one of the Slave nodes? – zoe Dec 09 '16 at 10:01

How does Spark partition's HDFS files?

1 Answers1