16

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just in a relative order within the file). I'd like to use this ordering information in my Spark processing, to do things like comparing a row with the previous row. I can't explicitly order the records, since there is no ordering column.

Does Spark maintain the order of records it reads in from a file? Or, is there any way to access the file-order of records from Spark?

Jason Evans
  • 1,197
  • 1
  • 13
  • 22

2 Answers2

16

Yes, when reading from file, Spark maintains the order of records. But when shuffling occurs, the order is not preserved. So in order to preserve the order, either you need to program so that no shuffling occurs in data or you create a seq. numbers to the records and use those seq. numbers while processing.

In a distribute framework like Spark where data is divided in cluster for fast processing, shuffling of data is sure to occur. So the best solution is create a sequential numbers to each rows and use that sequential number for ordering.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • 1
    When reading a large HDFS file with multiple parallel tasks, resulting in multiple partitions, how can you expect to have any concept of *order*?? OK, you can "guess" that you are processing the first split (which makes it possible to skip the header when present) or not, but for the sequential row numbering, what kind of trick would you use...? – Samson Scharfrichter Aug 22 '17 at 19:25
  • If the input data is already partitioned then its not possible to expect ordering as in case of hadoop file system. In that case we should have sequential numbers set before we store a file in hdfs. – Ramesh Maharjan Aug 22 '17 at 23:29
  • @RameshMaharjan If you're reading in a dataset from many files, which then go to one partition each, I assume file-order is maintained within each partition, but that there are no order guarantees across partitions / files? – Jason Evans Sep 30 '17 at 13:19
  • So the answer is correct or not? Reading contents second comment, ordered btw. – thebluephantom Aug 24 '19 at 06:47
  • 1
    @SamsonScharfrichter So us thd answer correct or not? I assume zipwithindex is the trick, but ... – thebluephantom Aug 24 '19 at 07:14
  • @RameshMaharjan Spark does not preserve ordering in case of partitioned files, not even within a single partition. Check this JIRA issue, https://issues.apache.org/jira/browse/SPARK-20144 – Sangram Gaikwad Feb 13 '20 at 14:58
  • I am using map partitions and generating the zipWithIndex so that each file is processed in a partition and zipWithIndex generates indexes to each partition (each file) separately. And I find this to be working in my case. @SangramGaikwad Is there any other alternative? – kjsr7 May 17 '20 at 03:08
7

Order is not preserved when the data is shuffled. You can, however, enumerate the rows before doing your computations. If you are using an RDD, there's a function called zipWithIndex (RDD[T] => RDD[(T, Long)]) that does exactly what you are searching.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Miguel
  • 1,201
  • 2
  • 13
  • 30
  • If there are multiple CSV files to be read, then mappartitions and zipWithIndex needs to be used. – kjsr7 May 17 '20 at 03:11