does df.write.csv create number of partitions equals to total no of records in the df?

Question

add_dist.write.format("csv").option("sep",",").mode("overwrite").save("C:\BigData\projects\datalake\address_op") i am trying to write into the folder in csv format using pyspark.

Dataframe has 25 total records and it is creating 25 part00000-part00024 partitions in the folder after writing.... what do i do to get all in single file(partition)

Use `repartition` as `add_dist.repartition(1).write.format("csv").option("sep",",").mode("overwrite").save` — Azhar Khan, Sep 11 '22 at 12:21

score 0 · Answer 1 · answered Sep 12 '22 at 06:53

It's more efficient to use coalesce instead of repartition in this case.

Here is a function that might help. With this function, you can also define the file name:

def export_csv(df, fileName, filePath):
  
  filePathDestTemp = filePath + ".dir/" 

  df\
    .coalesce(1)\
    .write\
    .save(filePathDestTemp) 

  listFiles = dbutils.fs.ls(filePathDestTemp)
  for subFiles in listFiles:
    if subFiles.name[-4:] == ".csv":
      
      dbutils.fs.cp (filePathDestTemp + subFiles.name,  filePath + fileName+ '.csv')

  dbutils.fs.rm(filePathDestTemp, recurse=True)

does df.write.csv create number of partitions equals to total no of records in the df?

1 Answers1