-2

Can anyone help me understand how apache-spark is able preserve the order of lines in output, when read from a textFile. Consider the below code snippet,

sparkContext.textFile(<inputTextFilePath>)
        .coalesce(1)
        .saveAsTextFile(<outputTextFilePath>)

The text file size is in GBs and I could see the data is read parallelly by worker nodes and written to the destination folder in a single file(since partition count is set to 1). When I open the output file, I could see all the lines are in order. How does Spark acheive this ordering?

Anoop Deshpande
  • 514
  • 1
  • 6
  • 23

1 Answers1

0

There is no guarantee in general.

coalesce has optimization logic based on partition locality. Then, given that a large file has many partitions that may be on same worker, there is no guarantee - in order to reduce shuffling - that order is preserved. It may be in some cases so, but not always.

for parquet, orc other considerations apply, but this is a text file you state.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83