How Apache Spark can preserve order of lines in the output textFile?

Question

Can anyone help me understand how apache-spark is able preserve the order of lines in output, when read from a textFile. Consider the below code snippet,

sparkContext.textFile(<inputTextFilePath>)
        .coalesce(1)
        .saveAsTextFile(<outputTextFilePath>)

The text file size is in GBs and I could see the data is read parallelly by worker nodes and written to the destination folder in a single file(since partition count is set to 1). When I open the output file, I could see all the lines are in order. How does Spark acheive this ordering?

Any idea if answer is incorrect? – thebluephantom Jul 19 '21 at 09:05 — thebluephantom, Jul 19 '21 at 09:05

score 0 · Answer 1 · answered Jul 16 '21 at 15:24

There is no guarantee in general.

coalesce has optimization logic based on partition locality. Then, given that a large file has many partitions that may be on same worker, there is no guarantee - in order to reduce shuffling - that order is preserved. It may be in some cases so, but not always.

for parquet, orc other considerations apply, but this is a text file you state.

How Apache Spark can preserve order of lines in the output textFile?

1 Answers1