4

I have an HDFS folder with two 250MB parquet files. The hadoop df block size is set to 128MB. Having the following code:

    JavaSparkContext sparkContext = new JavaSparkContext();

    SQLContext sqlContext = new SQLContext(sparkContext);
    DataFrame dataFrame = sqlContext.read().parquet("hdfs:////user/test/parquet-folder");
    LOGGER.info("Nr. of rdd partitions: {}", dataFrame.rdd().getNumPartitions());

    sparkContext.close();

I run it on the cluster with spark.executor.instances=3 and spark.executor.cores=4. I can see that the reading of the parquet files is split among 3 executors X 4 cores = 12 tasks:

   spark.SparkContext: Starting job: parquet at VerySimpleJob.java:25
   scheduler.DAGScheduler: Got job 0 (parquet at VerySimpleJob.java:25) with 12 output partitions

However, when I get the dataframe RDD (or create the RDD with toJavaRDD()) call, I get only 4 partitions. Is this controlled by the hdfs block size - 2 blocks for each file, hence 4 partitions?

Why exactly doesn't this match the number of partitions from the parquet (parent?) operation?

cristi.calugaru
  • 571
  • 10
  • 22
  • Answered below, but overall you're right - it's all about HDFS block size. – Zyoma Jun 28 '17 at 21:42
  • Based on @Zyoma suggestions, I've updated the code trying to force smaller splits which would give more input partitions for the data frame. The following configurations have been changed: **parquet.block.size, mapred.max.split.size, mapred.min.split.size all set to Long.toString(8 * 1024 * 1024L)** . This *still* gives me back 4 partitions – cristi.calugaru Jun 28 '17 at 23:05
  • How you got an answer how to get more partition after toJavaRDD call? – Xiawei Zhang Mar 13 '19 at 03:36

1 Answers1

3

When you're reading a file with Spark neither the number of executors nor the number of cores affects the number of tasks in any fashion. The number of partitions (and tasks as a result) is only determined by the number of blocks in your input. If you have 4 files that are smaller than HDFS block size - that'd be 4 blocks anyway and 4 partitions as a result. The formula is number_of_files * number_of_blocks_in_file. So look into your folder and count how many files does it contain and what is the size of each file. That should answer your question.

UPD: everything above is true if you didn't manually repartition your DataFrame and if your DataFrame wasn't created as a result of join or any other shuffle operation.

UPD: fixed answer details.

Zyoma
  • 1,528
  • 10
  • 17
  • My folder contains 2 files, each having 250MB. So basically you are saying that there is no way to have more partitions than the number of blocks (In this case, 4 blocks of 128 MB)? Why do I then see 12 tasks created when reading the file initially? Or my interpretation of what those 12 taska are is wrong? Here: https://stackoverflow.com/questions/27194333/how-to-split-parquet-files-into-many-partitions-in-spark someone suggests writing the parquet file with a smaller parquet.block.size might do the trick - but I've tried setting that but no luck. – cristi.calugaru Jun 28 '17 at 21:42
  • Correct. You can always force the number of partitions using **repartition** method. – Zyoma Jun 28 '17 at 21:44
  • I know repartition is an option, but that triggers shuffling, which is not optimal. I've got much more cores * executors in the cluster, which I want to make good use of, ideally by getting more partitions from the initial read operation. – cristi.calugaru Jun 28 '17 at 21:48
  • How did you set the "parquet.block.size" property? Like this **sparkContext.hadoopConfiguration.set("parquet.block.size", size)** ? – Zyoma Jun 28 '17 at 21:51
  • Anyway. It's absolutely unrelated. The parquet block size controls the block size of data in memory while writing the parquet file. It affects the compression rate and some other factors but won't affect partitions dramatically. – Zyoma Jun 28 '17 at 21:53
  • The question is rather how did you create those files? Perhaps you can adjust something during the writing process. – Zyoma Jun 28 '17 at 21:54
  • I have created the files doing: `dataFrame.write().mode(SaveMode.Overwrite).save(parquetOutputLocation);` where the dataframe was created doing `sqlContext.createDataFrame(myJavaRDD, MyBean.class)`. Before saving, I have set `sparkContext.hadoopConfiguration().set("parquet.block.size", Long.toString(8 * 1024 * 1024L));`. Another thing I found on Spark's Jira, where they advocate for the same thing: https://issues.apache.org/jira/browse/SPARK-10143?focusedCommentId=14707500&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14707500 – cristi.calugaru Jun 28 '17 at 22:09
  • I'm sorry, I've confused this value with **parquet.page.size** . Yeah, this should help. Have you tried to set smaller values for **mapred.max.split.size** and **mapred.min.split.size**? Like 8 * 1024 * 1024 L for each. – Zyoma Jun 28 '17 at 22:50