Update the records of a child table using spark sql / Hive

Question

I have a requirement to sync up the foreign key of child table with the parent table. The key in child table is way ahead of parent table. So I need to update the child table ID's to sync up with parent table.

Schema of Table A:

Id,name,age,height
101,xxx,24,21
102,aa,25,21
103,bb,26,21
104,cc,27,21
105,dd,28,21

Schema of Table B:

Id,route
101,22.21
102,23.21
213,34.55
214,25.55
216,22.44

I want to update the ID's of the last 3 rows of second table with 103,104,105 respectively. The child table has 9710369 records and these are to udpated with parent ID's in the same order. I have written the spark program as follows. But unfortunately spark sql job is taking lot of time to update the data. Also I coalesced the partitions to one to maintain order

val sourceIDs = "select id from parent table where  id > 5790681;
val sourceRDD = hc.sql(sourceIDs)
val ids = sourceRDD.map(r=>r.getLong(0)).collect().toList
val ss = sc.broadcast(ids);

val afterOffset = "select * from child table where id > 5790681;
val afterOffsetRDD = hc.sql(afterOffset).coalesce(1)
val count = new java.util.concurrent.atomic.AtomicInteger(0)
val modsetChange = afterOffsetRDD.map({
    row => (ss.value(count.incrementAndGet),row.getInt(1))
}).toDF()
modsetChange.write.format("orc").mode(SaveMode.Overwrite).saveAsTable(targettable);

Note: The ID's are not in sequential order and need to get parent table in same order, and the child table has 15331 partitions.

I would like to understand where exactly the job is consuming more time?. Also is there any way to achieve the above transformation in distributed manner in spark by maintaining the same order?. Also, can we achieve the above using Hive?.

Any help appreciated .

Thanks in Advance.

GPI · Accepted Answer · 2018-03-13T09:48:39.207

I would like to understand where exactly the job is consuming more time?

Most certainly, using a massively parallel big data framework, just to work single-threadedly on a locked primitive (even if AtomicLong is fast at its job) is quite slow. Moreover, to achieve that, you have a few time consuming steps, such as collecting data (which may not even work since you may have more than may fit inside the drivers memory). All in all, as you have very well guessed, this is not the right approach.

One very important point is that, on top of that, your use of an atomic counter is not valid in the spark programming model :

val count = new java.util.concurrent.atomic.AtomicInteger(0)
val modsetChange = afterOffsetRDD.map({
    row => (ss.value(count.incrementAndGet),row.getInt(1))
}).toDF()

That is because, in the spark programming model, rdd manipulations occur on "worker" nodes, while the main program is executed by the "driver" node. Here, the count variable is therefore held inside the driver, and the row => ... code is executed by the workers, which are probably not even on the same computer. Spark makes this seamless by shipping a copy of count to each worker, meaning that each worker will have its own count. Which works here because, by coalescing the partitions to one, you have only one worker, but had this not been the case, you would have had unpredictable result.

So case in point : never modify driver-side objects inside RDD operations. Unless these are broadcast variables, which you can check in the documentation.

Also is there any way to achieve the above transformation in distributed manner in spark by maintaining the same order?

Yes there is. But : do you realize that the code you have shown does not actually maintain any order ? (At least, that there is no guarantee that it does so) ? That is because your select clauses have no order by. Therefore, the execution engine is free to reorder the datas in any way it wants (e.g. not the line order). Nothing ressembling your goal is achievable through a SQL API as long as there is not ORDER BY involved. Without this, Spark does not guarantee that RDD elements are ordered. You may be under certain circumstances that do provide the sorted behaviour, it's just not provided "by default".

If your original input is text files, though, you can always do:

sc.textFile(yourDataFile)

and be assured that lines are in order.

The way I would do it, provided you have a resolved the above issue.

Using pure RDD API :

Let's say we have table A and B as ordered RDDs :

scala> val rddA = sc.parallelize(Seq((101, "xxx"), (102, "aa"), (103, "bb")))
scala> val rddB = sc.parallelize(Seq((101, 22), (102, 23), (213, 34)))

Then what I want to do is using zipWithIndex to add a row number to each line of both files. Then, I will ask Spark to join the two RDDs by grouping together such line numbers.

val rddAWithPosition = rddA.zipWithIndex.map(_.swap)
val rddBWithPosition = rddB.zipWithIndex.map(_.swap)
val joinRDD = rddAWithPosition.join(rddBWithPosition)
// What JoinRDD looks like : 
scala> joinRDD.take(1)
res2: Array[(Long, ((Int, String), (Int, Int)))] = Array((0,((101,xxx),(101,22))))

You see how each RDD element is now a 3-tuple with 1) the line number, 2) the tableA element, 3) the table B element. You can now re-arrange as you see fit.

Also, can we achieve the above using Hive?.

Yes, if you can, once more, define an order, and use the row_number() function. Then again, create a new table with the row number for tableA and tableB, then join on the row number.

Update the records of a child table using spark sql / Hive

1 Answers1

Using pure RDD API :