Hive : Validate the quality of data filled in array by comparing with data definition record and find percentage of data filled, quality rank of data

Question

I have two tables. Table t1 defines the metadata. ie, what are the attribute values an ideal transaction should contain. It also defines the order of importance of attributes by the order of records in the array. The first record is most important and it has weightage of 1. 2nd one has 0.9, 3rd - 0.8, 4th - 0.7 and so on.... Anything above 10 is of least important. I need to find the quality of data filled in the transaction table t2. Find the percentage of attributes filled and what is the quality rank of them.

t1

 ------------------------------------
|  a_id  |    attribute_values      |    
------------------------------------
|  12345 | ["a1", "a2", "a3", "a5"] |
|  6789  | ["b1", "b4", "b7"]       |
 ------------------------------------

t2

------------------------------------
| b_id  |  a_id   | attribute_values|
------------------------------------
| B123  |  12345  | ["a2", "a5"]    |
| B456  |  6789   | ["b1, "b7"]     |
-------------------------------------

I am looking for way to calculate the quality rank for my t2 records as below

------------------------------------------
| b_id | percent_complete | quality_rank |
------------------------------------------
| B123 |    50            |    0.4.      |
| B456 |   66.66          |    0.6.      |
------------------------------------------

B123 - (2 out of 4) 50% complete. quality rank - (0.9+0.7)/4 = 0.4

B456 - (2 out of 3) 66.66% complete. quality rank - (1+0.8)/3 = 0.6

score 0 · Answer 1 · answered Jul 03 '20 at 09:00

Solved it by exploding both the tables. Calculated the weight and rank for the first table and then joined with the other table. Not able to do it in single sql though.

scala> val t1 = Seq((12345, List("a1", "a2", "a3", "a5")), (6789, List("b1", "b5", "b7"))).toDF("a_id", "attribute_values")

scala> val t2 = Seq(("B123", 12345, List("a2", "a5")), ("B456", 6789, List("b1", "b7"))).toDF("b_id","a_id", "attribute_values")

scala> val t1_1 = t1.select($"a_id", posexplode($"attribute_values"))

scala> t1_1.show

+-----+---+---+
| a_id|pos|col|
+-----+---+---+
|12345|  0| a1|
|12345|  1| a2|
|12345|  2| a3|
|12345|  3| a5|
| 6789|  0| b1|
| 6789|  1| b5|
| 6789|  2| b7|
+-----+---+---+

scala> t1_1.createOrReplaceTempView("tab_t1_1")

scala> spark.sql("select *, 1 - (pos * 0.1) as calc_weight, count(col) over (partition by a_id) as rec_count from tab_t1_1").show

+-----+---+---+-----------+---------+
| a_id|pos|col|calc_weight|rec_count|
+-----+---+---+-----------+---------+
| 6789|  0| b1|        1.0|        3|
| 6789|  1| b5|        0.9|        3|
| 6789|  2| b7|        0.8|        3|
|12345|  0| a1|        1.0|        4|
|12345|  1| a2|        0.9|        4|
|12345|  2| a3|        0.8|        4|
|12345|  3| a5|        0.7|        4|
+-----+---+---+-----------+---------+

scala> val t1_2 = spark.sql("select *, 1 - (pos * 0.1) as calc_weight, count(col) over (partition by a_id) as rec_count from tab_t1_1")

scala> t1_2.createOrReplaceTempView("tab_t1_2")

scala> val t2_1 = t2.select($"b_id", $"a_id", explode($"attribute_values"))

scala> t2_1.show

+----+-----+---+
|b_id| a_id|col|
+----+-----+---+
|B123|12345| a2|
|B123|12345| a5|
|B456| 6789| b1|
|B456| 6789| b7|
+----+-----+---+

scala> t2_1.createOrReplaceTempView("tab_t2_1")

scala> spark.sql("Select b_id, t1.a_id, round(count(t2.col)*100/max(t1.rec_count),2) as percent_complete, round(sum(t1.calc_weight)/ max(t1.rec_count),2) as quality_rank from tab_t1_2 t1, tab_t2_1 t2 where t1.a_id = t2.a_id and t1.col = t2.col group by b_id, t1.a_id").show

+----+-----+----------------+------------+
|b_id| a_id|percent_complete|quality_rank|
+----+-----+----------------+------------+
|B123|12345|            50.0|        0.40|
|B456| 6789|           66.67|        0.60|
+----+-----+----------------+------------+

Hive : Validate the quality of data filled in array by comparing with data definition record and find percentage of data filled, quality rank of data

1 Answers1