0

I am trying to understand the how the partitions in sparks works for NON KEY VALUE pair records. The commands are as follows

val FileRDD= sc.textFile("hdfs://nameservice1:8020/apps/file/outbound/terms/processed/HDFC_outbound/HDFC_EXTRACT_604/HDFC_RBI_20180205.dat");

The file size is 512 bytes. I have not configured any partitions also, partitioner=NONE. when I give the command FileRDD.partitions.size, I am getting 2 partitions. I would like to understand how there are two partitions.

Karthi
  • 708
  • 1
  • 19
  • 38
  • Spark tries to avoid data movement between nodes on read. So that means it will read parts of files on the machines these parts are located. Probably you have only 2 parts of your file and thus you have 2 partitions. You can check this by checking how much part of the file you have [here](https://stackoverflow.com/questions/29143289/hadoop-hdfs-command-to-see-how-a-files-splits) is guide. For bigger files there might also be more than one partition/per node, i think. – Vladislav Varslavans May 15 '18 at 07:09
  • Spark uses HDFS InputFormat API's under the hood. The number of partitions is based on the Block Size i.e the physical division of the data.To further Split the data within partitions is also possible.Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. – wandermonk May 15 '18 at 08:09
  • but the total size of the files is just 512 bytes. So this should be ideally 1 split. – Karthi May 15 '18 at 08:30
  • [Default Partitioning Scheme in Spark](https://stackoverflow.com/q/34491219/9613318) – Alper t. Turker May 15 '18 at 09:35

0 Answers0