Pyspark read all files and write it back it to same file after transformation

Question

Hi I have files in a directory Folder/1.csv Folder/2.csv Folder/3.csv

I want to read all these files in a pyspark dataframe/rdd and change some column value and write it back to same file. I have tried it but it creating new file in the folder part_000 something but I want to write the data in to same file whatever the contents in 1.csv , 2.csv,3.csv after modification in column values

How can I achieve that using loop or loading file in to each dataframe or how it possible with array or any logic ?

score 0 · Answer 1 · answered Jun 30 '22 at 13:00

Let's say after your transformations that df_1, df_2 and df_3 are the datafames that will be saved back into the folder with the same name.

Then, you can use this function:

def export_csv(df, fileName, filePath):
  
  filePathDestTemp = filePath + ".dir/" 

  df\
    .coalesce(1)\
    .write\
    .mode('overwrite')
    .save(filePathDestTemp) 

  listFiles = dbutils.fs.ls(filePathDestTemp)
  for subFiles in listFiles:
    if subFiles.name[-4:] == ".csv":
      
      dbutils.fs.cp (filePathDestTemp + subFiles.name,  filePath + fileName+ '.csv')

  dbutils.fs.rm(filePathDestTemp, recurse=True)

...and call it for each df:

export_csv(df_1, '1.csv', 'Folder/')
export_csv(df_2, '2.csv', 'Folder/')
export_csv(df_3, '3.csv', 'Folder/')

Can you explain the logic here which version of spark support this function ? Can I use Scala class or in pyspark — Priya p, Jun 30 '22 at 16:46

Pyspark read all files and write it back it to same file after transformation

1 Answers1