I have a table created in hive stored in s3 location. It has about 10 columns and is partitioned on 3 columns month, year and city , in the same order.
I am running a spark job that creates a dataframe(2 billion rows) and writes into this table.
val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.mode("overwrite").insertInto(s"$tableName")
selectExpr(cs.map(_.name): _*) reorders the columns in the dataframe to align with the ordering in the table.
When i run the above command to insert into the table, I see there are so many staging files and multiple small files created under each city.
s3://s3Root/tableName/month/year/city/file1.csv
file2.csv
...
file200.csv
I am hoping to get a single file under each city per year per month. To coalesce per partition.
Expected:
s3://s3Root/tableName/month/year/city/file.csv
Any help is appreciated.