Exporting spark dataframe to .csv with header and specific filename

Question

I am trying to export data from a spark dataframe to .csv file:

df.coalesce(1)\
  .write\
  .format("com.databricks.spark.csv")\
  .option("header", "true")\
  .save(output_path)

It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"

I want the filename to be "part-r-00000.csv" or "part-00000.csv"

As the file is being created on AWS S3, I am limited in how I can use os.system commands.

How can I set the file name while keeping the header in the file?

Thanks!

Does it matter what it's called on HDFS if you can pull it down and name it whatever you like? `hadoop fs -cat path/to/output/part-r-* > path/to/local/file.csv` will dump all the parts from Hadoop into one file on your local disk. — pault, Feb 06 '18 at 21:17
I'm trying to build an automated pyspark etl job, hence I'm avoiding all sorts of manual and local system interventions. my ask is simple 'when exporting data from a spark data frame to .csv file, the file name should be 'part-00000'. appreciate your response. — Naresh Y, Feb 06 '18 at 21:59
I've never seen this type of filename before. Maybe the file already exists and this long name was made to avoid overwriting the file? — pault, Feb 06 '18 at 22:00
no, I did cleanup the target folder and reran the job. that didn't help. — Naresh Y, Feb 06 '18 at 22:09
AFAIK, the file name is `part-#{partition number}-#{random uuid}-#{something}`. AFAIK, the UUID is to allow multiple executors to write to the same directory without worrying about trying to write to the same file. — Narfanator, Dec 28 '18 at 01:33

score 9 · Answer 1 · answered Feb 07 '18 at 16:30

Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.

Coming to the solution:

When we create a .csv file from spark dataframe,

The output file is by default named part-x-yyyyy where:

1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.

In order to rename the output file, running an os.system HDFS command should do.

import os, sys
output_path_stage = //set the source folder path here
output_path  = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + '  ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)

fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path) , file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.

Hey dude, sorry people dinged you for grammar. That's bad behavior on their part. — Narfanator, Dec 28 '18 at 01:31
Hmm. Shoot. I came across your question and was hoping to be able to set it, rather than use commands to rename on the filesystem. Time to go digging! — Narfanator, Dec 28 '18 at 01:36
It's not looking like it's possible with the libraries as-is, but people point to this for more info about adding this functionality: https://wiki.apache.org/hadoop/FAQ#How_do_I_change_final_output_file_name_with_the_desired_name_rather_than_in_partitions_like_part-00000.2C_part-00001.3F — Narfanator, Dec 28 '18 at 01:44
Would you be able to locate the output file and rename it? e.g. use `glob` to get the file and then rename or create a copy of it with the intended filename? — AK91, Apr 09 '21 at 08:17

Exporting spark dataframe to .csv with header and specific filename

1 Answers1

Linked