0

I want to migrate my pyspark code from 1.6 to 2.x. In 1.6 I was using syntax

input_df.repartition(number_of_files) \
    .write.mode(file_saveMode) \
    .format(file_format) \
    .option("header", "true") \
    .save(nfs_path)

And was getting output in below format.

part-00000

part-00001

. .

I ran the same code in pyspark2.2, it gave me different part file names

part-00000-2feefae7-47d7-4f1a-ade6-7dbd07f42f54-c000.csv

part-00001-2feefae7-47d7-4f1a-ade6-7dbd07f42f54-c000.csv

Then I change the code as per 2.x

input_df.repartition(number_of_files) \
    .write.mode(file_saveMode) \
    .option("header", "true") \
    .csv(nfs_path)

But still the same result

part-00000-2feefae7-47d7-4f1a-ade6-7dbd07f42f54-c000.csv

Can anyone help why this is happening?

SB07
  • 76
  • 7
  • Exact name of the output file was never guaranteed, so it is not a breaking change. Why is it a problem for you? – Alper t. Turker Jun 02 '18 at 17:54
  • My other processes depend on these exact output. And it was not the case in spark1.6, every time I found the same result. If this is the default behaviour in spark2.2, I need to make changes in the dependent processes. – SB07 Jun 02 '18 at 18:02

0 Answers0