Can anyone help me understand how apache-spark
is able preserve the order of lines in output, when read from a textFile. Consider the below code snippet,
sparkContext.textFile(<inputTextFilePath>)
.coalesce(1)
.saveAsTextFile(<outputTextFilePath>)
The text file size is in GBs and I could see the data is read parallelly by worker nodes and written to the destination folder in a single file(since partition count is set to 1
). When I open the output file, I could see all the lines are in order. How does Spark acheive this ordering?