Number of splits in dataset exceeds dataset split limit ,Dremio+Hive+Spark

Question

We have a stack consisting of Hadoop+Hive+Spark+Dremio , since Spark writes many HDFS files for a single Hive partition (depending on workers) Dremio is failing when querying the table because the number of HDFS files limit is exceeded , is there any way to solve this without having to manually set a smaller number of workers in spark?(we don't want to lose spark distributed performance and benefits) .

score 0 · Accepted Answer · answered Nov 05 '19 at 22:47

You can use the repartition which will create 1 file per partition. This will ensure that you have atleast 1 task per partition which will ensure that there is enough parallelism maintained in your spark job.

df.repartition($"a", $"b", $"c", $"d", $"e").write.partitionBy("a", "b", "c", "d", "e").mode(SaveMode.Append).parquet(s"$location")

Number of splits in dataset exceeds dataset split limit ,Dremio+Hive+Spark

1 Answers1