3

I am having a Dataset in Spark (v2.1.1) with 3 columns (as shown below) containing hierarchical data.

  • My target objective is to assign incremental numbering to each row based on the parent-child hierarchy. Graphically it can be said that the hierarchical data is a collection of trees.
  • As per below table, I already have the rows grouped based on 'Global_ID'. Now I would like to generate the 'Value' column in an incremental order but based on the hierarchy of data from 'Parent' and 'Child' columns.

Tabular Representation (Value is the desired output):

    +-----------+--------+-------+         +-----------+--------+-------+-------+
    |      Current Dataset       |         |      Desired Dataset (Output)      |
    +-----------+--------+-------+         +-----------+--------+-------+-------+
    | Global_ID | Parent | Child |         | Global_ID | Parent | Child | Value |
    +-----------+--------+-------+         +-----------+--------+-------+-------+
    |       111 |    111 |   123 |         |       111 |    111 |   111 |     1 |
    |       111 |    135 |   246 |         |       111 |    111 |   123 |     2 |
    |       111 |    123 |   456 |         |       111 |    123 |   789 |     3 |
    |       111 |    123 |   789 |         |       111 |    123 |   456 |     4 |
    |       111 |    111 |   111 |         |       111 |    111 |   135 |     5 |
    |       111 |    135 |   468 |         |       111 |    135 |   246 |     6 |
    |       111 |    135 |   268 |         |       111 |    135 |   468 |     7 |
    |       111 |    268 |   321 |         |       111 |    135 |   268 |     8 |
    |       111 |    138 |   139 |         |       111 |    268 |   321 |     9 |
    |       111 |    111 |   135 |         |       111 |    111 |   138 |    10 |
    |       111 |    111 |   138 |         |       111 |    138 |   139 |    11 |
    |       222 |    222 |   654 |         |       222 |    222 |   222 |    12 |
    |       222 |    654 |   721 |         |       222 |    222 |   987 |    13 |
    |       222 |    222 |   222 |         |       222 |    222 |   654 |    14 |
    |       222 |    721 |   127 |         |       222 |    654 |   721 |    15 |
    |       222 |    222 |   987 |         |       222 |    721 |   127 |    16 |
    |       333 |    333 |   398 |         |       333 |    333 |   333 |    17 |
    |       333 |    333 |   498 |         |       333 |    333 |   398 |    18 |
    |       333 |    333 |   333 |         |       333 |    333 |   498 |    19 |
    |       333 |    333 |   598 |         |       333 |    333 |   598 |    20 |
    +-----------+--------+-------+         +-----------+--------+-------+-------+

Tree Representation (Desired value is represented next to each node):

                      +-----+                                           +-----+
                   1  | 111 |                                       17  | 333 |
                      +--+--+                                           +--+--+
                         |                                                 |
         +---------------+--------+-----------------+           +----------+----------+
         |                        |                 |           |          |          |
      +--v--+                  +--v--+           +--v--+     +--v--+    +--v--+    +--v--+
   2  | 123 |                5 | 135 |        10 | 138 |     | 398 |    | 498 |    | 598 |
      +--+--+                  +--+--+           +--+--+     +--+--+    +--+--+    +--+--+  
   +-----+-----+         +--------+--------+        |          18         19         20
   |           |         |        |        |        |  
+--v--+     +--v--+   +--v--+  +--v--+  +--v--+  +--v--+ 
| 789 |     | 456 |   | 246 |  | 468 |  | 268 |  | 139 |                 +-----+
+-----+     +-----+   +-----+  +-----+  +--+--+  +-----+             12  | 222 |
   3           4         6        7      8 |        11                   +--+--+
                                        +--v--+                             |
                                        | 321 |                      +------+-------+
                                        +--+--+                      |              |
                                           9                      +--v--+        +--v--+
                                                               13 | 987 |    14  | 654 |
                                                                  +--+--+        +--+--+
                                                                                    |
                                                                                 +--v--+
                                                                             15  | 721 |
                                                                                 +--+--+
                                                                                    |
                                                                                 +--v--+
                                                                             16  | 127 |
                                                                                 +--+--+

Code Snippet:

Dataset<Row> myDataset = spark
                .sql("select Global_ID, Parent, Child from RECORDS");

JavaPairRDD<Row,Long> finalDataset = myDataset.groupBy(new Column("Global_ID"))
    .agg(functions.sort_array(functions.collect_list(new Column("Parent").as("parent_col"))),
        functions.sort_array(functions.collect_list(new Column("Child").as("child_col"))))
    .orderBy(new Column("Global_ID"))
    .withColumn("vars", functions.explode(<Spark UDF>)
    .select(new Column("vars"),new Column("parent_col"),new Column("child_col"))
    .javaRDD().zipWithIndex();


// Sample UDF (TODO: Actual Implementation)   
spark.udf().register("computeValue",
                (<Column Names>) -> <functionality & implementation>,
                DataTypes.<xxx>);

After lot of research and going through many suggestions in blogs, I have tried the below approaches but of no avail to achieve the result for my scenario.

Tech Stack :

  • Apache Spark (v2.1.1)

  • Java 8

  • AWS EMR Cluster (Spark App Deployment)


Data Volume:

  • Approximately ~20 million rows in the Dataset

Approaches Tried:

  1. Spark GraphX + GraphFrames:

  2. Spark GraphX Pregel API:


Any suggestions for alternatives (or) modifications in current approaches would be really helpful as I am totally lost in figuring out the solution for this use case.

Appreciate your help! Thank you!

Sridher
  • 201
  • 3
  • 11

1 Answers1

1

Note: The below solution is in scala spark. You can easily translate to java code.

Check this out. I tried doing it using Spark Sql you can get an idea. Basically idea is to sort the child, parent and globalid while aggregating and grouping them. Once grouped and sorted by globalid expand the rest. You will get ordered result table to which later you can zipWithIndex to add the rank (value)

   import org.apache.spark.sql.SQLContext
   import org.apache.spark.sql.functions._
   import org.apache.spark.sql.expressions.UserDefinedFunction
   import org.apache.spark.sql.functions.udf

   val sqlContext = new SQLContext(sc)
   import sqlContext.implicits._

   val t = Seq((111,111,123), (111,111,111), (111,123,789), (111,268,321), (222,222,654), (222,222,222), (222,721,127), (333,333,398), (333,333,333), (333,333,598))
   val ddd = sc.parallelize(t).toDF
   val zip = udf((xs: Seq[Int], ys: Seq[Int]) => xs zip ys)
   val dd1 = ddd
    .groupBy($"_1")
    .agg(sort_array(collect_list($"_2")).as("v"),
         sort_array(collect_list($"_3")).as("w"))
    .orderBy(asc("_1"))
    .withColumn("vars", explode(zip($"v", $"w")))
    .select($"_1", $"vars._1", $"vars._2").rdd.zipWithIndex

  dd1.collect

Output

    res24: Array[(org.apache.spark.sql.Row, Long)] = Array(([111,111,111],0), ([111,111,123],1), ([111,123,321],2),
([111,268,789],3), ([222,222,127],4), ([222,222,222],5), ([222,721,654],6),([333,333,333],7), ([333,333,398],8), ([333,333,598],9))
Sumeet Sharma
  • 2,573
  • 1
  • 12
  • 24
  • This seems like a very feasible solution. Thank you for the code! I will definitely give this a try but just a little concerned with 'collect as list' since my data is around ~20 million rows but nevertheless it should be good. It would be great if you can provide a Java equivalent of the same Scala code since I am a complete newbie to Scala. Thanks again! – Sridher Dec 27 '17 at 20:11
  • @Sridher i think you can easily convert it to java code. THe main thing to note here the spark code that you can replicate in java. – Sumeet Sharma Dec 30 '17 at 11:32
  • I converted most of the code to its java equivalent, but was facing some issue with regards to Spark UDF that you have used. Could you please help me out on the UDF? Kindly refer to my edited post for code snippet. – Sridher Jan 02 '18 at 01:20
  • https://stackoverflow.com/questions/39735864/spark-java-udf-returning-multiple-columns this might help – Sumeet Sharma Jan 03 '18 at 06:34
  • @Sridher were you able to do it? – Sumeet Sharma Jan 10 '18 at 10:46
  • Thank you for the references & I was able to convert the Scala code to its Java equivalent. However, the result was not as I expected, the ordering remains good for a single tree hierarchical data, but my use-case is collections of trees and the incremental ID for the next tree should start from the previous one as described in the graphical representation. Ultimately, I had to write an customized implementation of DFS algorithm using pure Java approach and then converting that collection back to a Dataset in Spark. – Sridher Jan 10 '18 at 21:10