0

I have a table created in hive stored in s3 location. It has about 10 columns and is partitioned on 3 columns month, year and city , in the same order.

I am running a spark job that creates a dataframe(2 billion rows) and writes into this table.

val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.mode("overwrite").insertInto(s"$tableName")

selectExpr(cs.map(_.name): _*) reorders the columns in the dataframe to align with the ordering in the table.

When i run the above command to insert into the table, I see there are so many staging files and multiple small files created under each city.

s3://s3Root/tableName/month/year/city/file1.csv
                                      file2.csv
                                      ...
                                      file200.csv

I am hoping to get a single file under each city per year per month. To coalesce per partition.

Expected:

s3://s3Root/tableName/month/year/city/file.csv

Any help is appreciated.

dreddy
  • 463
  • 1
  • 7
  • 21

1 Answers1

0

to achieve one file by partition you should use

.partitionBy("")

val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))

df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.partitionBy(partitions: _*).mode("overwrite").insertInto(s"$tableName")

I think that you could avoid to make the repartition before, if you do only the partitionBy,files will be partitioned by one per partition.

Angel F O
  • 76
  • 6