-1

add_dist.write.format("csv").option("sep",",").mode("overwrite").save("C:\BigData\projects\datalake\address_op") i am trying to write into the folder in csv format using pyspark.

Dataframe has 25 total records and it is creating 25 part00000-part00024 partitions in the folder after writing.... what do i do to get all in single file(partition)

  • Use `repartition` as `add_dist.repartition(1).write.format("csv").option("sep",",").mode("overwrite").save` – Azhar Khan Sep 11 '22 at 12:21

1 Answers1

0

It's more efficient to use coalesce instead of repartition in this case.

Here is a function that might help. With this function, you can also define the file name:

def export_csv(df, fileName, filePath):
  
  filePathDestTemp = filePath + ".dir/" 

  df\
    .coalesce(1)\
    .write\
    .save(filePathDestTemp) 

  listFiles = dbutils.fs.ls(filePathDestTemp)
  for subFiles in listFiles:
    if subFiles.name[-4:] == ".csv":
      
      dbutils.fs.cp (filePathDestTemp + subFiles.name,  filePath + fileName+ '.csv')

  dbutils.fs.rm(filePathDestTemp, recurse=True)
Luiz Viola
  • 2,143
  • 1
  • 11
  • 30