I would like to understand where exactly the job is consuming more time?
Most certainly, using a massively parallel big data framework, just to work single-threadedly on a locked primitive (even if AtomicLong
is fast at its job) is quite slow. Moreover, to achieve that, you have a few time consuming steps, such as collect
ing data (which may not even work since you may have more than may fit inside the drivers memory).
All in all, as you have very well guessed, this is not the right approach.
One very important point is that, on top of that, your use of an atomic counter is not valid in the spark programming model :
val count = new java.util.concurrent.atomic.AtomicInteger(0)
val modsetChange = afterOffsetRDD.map({
row => (ss.value(count.incrementAndGet),row.getInt(1))
}).toDF()
That is because, in the spark programming model, rdd manipulations occur on "worker" nodes, while the main program is executed by the "driver" node. Here, the count
variable is therefore held inside the driver, and the row => ...
code is executed by the workers, which are probably not even on the same computer. Spark makes this seamless by shipping a copy of count
to each worker, meaning that each worker will have its own count
. Which works here because, by coalescing the partitions to one, you have only one worker, but had this not been the case, you would have had unpredictable result.
So case in point : never modify driver-side objects inside RDD operations. Unless these are broadcast variables, which you can check in the documentation.
Also is there any way to achieve the above transformation in distributed manner in spark by maintaining the same order?
Yes there is. But : do you realize that the code you have shown does not actually maintain any order ? (At least, that there is no guarantee that it does so) ? That is because your select
clauses have no order by
. Therefore, the execution engine is free to reorder the datas in any way it wants (e.g. not the line order). Nothing ressembling your goal is achievable through a SQL API as long as there is not ORDER BY
involved.
Without this, Spark does not guarantee that RDD elements are ordered. You may be under certain circumstances that do provide the sorted behaviour, it's just not provided "by default".
If your original input is text files, though, you can always do:
sc.textFile(yourDataFile)
and be assured that lines are in order.
The way I would do it, provided you have a resolved the above issue.
Using pure RDD API :
Let's say we have table A and B as ordered RDDs :
scala> val rddA = sc.parallelize(Seq((101, "xxx"), (102, "aa"), (103, "bb")))
scala> val rddB = sc.parallelize(Seq((101, 22), (102, 23), (213, 34)))
Then what I want to do is using zipWithIndex
to add a row number to each line of both files. Then, I will ask Spark to join the two RDDs by grouping together such line numbers.
val rddAWithPosition = rddA.zipWithIndex.map(_.swap)
val rddBWithPosition = rddB.zipWithIndex.map(_.swap)
val joinRDD = rddAWithPosition.join(rddBWithPosition)
// What JoinRDD looks like :
scala> joinRDD.take(1)
res2: Array[(Long, ((Int, String), (Int, Int)))] = Array((0,((101,xxx),(101,22))))
You see how each RDD element is now a 3-tuple with 1) the line number, 2) the tableA element, 3) the table B element. You can now re-arrange as you see fit.
Also, can we achieve the above using Hive?.
Yes, if you can, once more, define an order, and use the row_number() function. Then again, create a new table with the row number for tableA and tableB, then join on the row number.