How to write .csv File in ADLS Using Pyspark

Question

I am reading json file from adls then write it back to ADLS by changing extension to .csv but some random filename is creating in ADLS (writing script in azure synapse)

One _success file and part-000-***.csv like this some random file name is generating

I want my file name is to be save ex: sfmc.json it should be write in adls as sfmc.csv

score 0 · Answer 1 · answered Nov 07 '22 at 06:17

That is how data from different partitions is persisted in spark. You can use databricks fs utility to rename the file.

I have written a small utility function to gather all data on one partition, persist as parquet and rename the only data file in the folder. You can adopt it for JSON or CSV. The utility accepts the folder path and file name, creates a "tmp" folder for persistence, and then moves and renames the file to desired folder:

def export_spark_df_to_parquet(df, dir_dbfs_path, parquet_file_name):
  tmp_parquet_dir_name = "tmp"
  tmp_parquet_dir_dbfs_path = dir_dbfs_path + "/" + tmp_parquet_dir_name
  parquet_file_dbfs_path = dir_dbfs_path + "/" + parquet_file_name
  
  # Export dataframe to Parquet
  df.repartition(1).write.mode("overwrite").parquet(tmp_parquet_dir_dbfs_path)
  listFiles = dbutils.fs.ls(tmp_parquet_dir_dbfs_path)
  for _file in listFiles:
    if len(_file.name) > len(".parquet") and _file.name[-len(".parquet"):] == ".parquet":
      dbutils.fs.cp(_file.path, parquet_file_dbfs_path)
      break

Usage:

export_spark_df_to_parquet(df, "dbfs:/my_folder", "my_df.parquet")

score 0 · Answer 2 · answered Nov 07 '22 at 06:17

Spark does not allow to name a file as required. It would generate part files with random file names. When I used df.write (where df is a spark dataframe), I get a randomly generated filename.

enter image description here

If you want to generate a filename with specific name, you have to use pandas. Convert the spark dataframe to pandas dataframe using toPandas() and then save the file using to_csv() method (considering csv as the required file format).

pdf = df.toPandas()
pdf.to_csv("abfss://data@datalk0711.dfs.core.windows.net/output/output.csv")

enter image description here

Running the above code produced the required file with required file name.

enter image description here

How to write .csv File in ADLS Using Pyspark

2 Answers2