-1

Hi I have files in a directory Folder/1.csv Folder/2.csv Folder/3.csv

I want to read all these files in a pyspark dataframe/rdd and change some column value and write it back to same file. I have tried it but it creating new file in the folder part_000 something but I want to write the data in to same file whatever the contents in 1.csv , 2.csv,3.csv after modification in column values

How can I achieve that using loop or loading file in to each dataframe or how it possible with array or any logic ?

Priya p
  • 1
  • 2

1 Answers1

0

Let's say after your transformations that df_1, df_2 and df_3 are the datafames that will be saved back into the folder with the same name.

Then, you can use this function:

def export_csv(df, fileName, filePath):
  
  filePathDestTemp = filePath + ".dir/" 

  df\
    .coalesce(1)\
    .write\
    .mode('overwrite')
    .save(filePathDestTemp) 

  listFiles = dbutils.fs.ls(filePathDestTemp)
  for subFiles in listFiles:
    if subFiles.name[-4:] == ".csv":
      
      dbutils.fs.cp (filePathDestTemp + subFiles.name,  filePath + fileName+ '.csv')

  dbutils.fs.rm(filePathDestTemp, recurse=True)

...and call it for each df:

export_csv(df_1, '1.csv', 'Folder/')
export_csv(df_2, '2.csv', 'Folder/')
export_csv(df_3, '3.csv', 'Folder/')
Luiz Viola
  • 2,143
  • 1
  • 11
  • 30
  • Can you explain the logic here which version of spark support this function ? Can I use Scala class or in pyspark – Priya p Jun 30 '22 at 16:46