0

We have a stack consisting of Hadoop+Hive+Spark+Dremio , since Spark writes many HDFS files for a single Hive partition (depending on workers) Dremio is failing when querying the table because the number of HDFS files limit is exceeded , is there any way to solve this without having to manually set a smaller number of workers in spark?(we don't want to lose spark distributed performance and benefits) .

Luis Leal
  • 3,388
  • 5
  • 26
  • 49

1 Answers1

0

You can use the repartition which will create 1 file per partition. This will ensure that you have atleast 1 task per partition which will ensure that there is enough parallelism maintained in your spark job.

df.repartition($"a", $"b", $"c", $"d", $"e").write.partitionBy("a", "b", "c", "d", "e").mode(SaveMode.Append).parquet(s"$location")
Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26