One file per partition (Coalesce per pertition) while inserting data into hive table

Question

I have a table created in hive stored in s3 location. It has about 10 columns and is partitioned on 3 columns month, year and city , in the same order.

I am running a spark job that creates a dataframe(2 billion rows) and writes into this table.

val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.mode("overwrite").insertInto(s"$tableName")

selectExpr(cs.map(_.name): _*) reorders the columns in the dataframe to align with the ordering in the table.

When i run the above command to insert into the table, I see there are so many staging files and multiple small files created under each city.

s3://s3Root/tableName/month/year/city/file1.csv
                                      file2.csv
                                      ...
                                      file200.csv

I am hoping to get a single file under each city per year per month. To coalesce per partition.

Expected:

s3://s3Root/tableName/month/year/city/file.csv

Any help is appreciated.

score 0 · Answer 1 · answered Mar 12 '18 at 22:20

to achieve one file by partition you should use

.partitionBy("")

val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))

df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.partitionBy(partitions: _*).mode("overwrite").insertInto(s"$tableName")

I think that you could avoid to make the repartition before, if you do only the partitionBy,files will be partitioned by one per partition.

One file per partition (Coalesce per pertition) while inserting data into hive table

1 Answers1