0

I have a scenario find the lowest level child nodes from hierarchy table having parent_node_id and child_node_id as below. Source table is in Hive and Impala database. Please suggest hive/impala query to find out the lowest level child nodes for each parent node in source table.

I tried using CTE recursive query in Impala but I guess it is not supported.

Thank you in Advance!

Source Table:

+-------------+--------------+
|child_node_id|parent_node_id|
+-------------+--------------+
|     C1      |      P1      |
+-------------+--------------+
|     C2      |      P2      |   
+-------------+--------------+
|     C11     |      C1      |
+-------------+--------------+
|     C12     |      C11     |
+-------------+--------------+
|     123     |      C12     |
+-------------+--------------+

Expected Output:

+-------------+--------------+
|parent_node  |lowest_l_child|
+-------------+--------------+
|     P1      |      123     | 
+-------------+--------------+
|      P2     |       C2     |
+-------------+--------------+
|     C1      |      123     |
+-------------+--------------+
|     C11     |      123     |
+-------------+--------------+
|     C12     |      123     |
+-------------+--------------+
A Saraf
  • 315
  • 4
  • 20

1 Answers1

2

Since hive does not support recursive CTE query.

Please refer to [https://blog.pythian.com/recursion-in-hive/][1] for one option.

Other option would be to use shell script to loop and query to find lowest child for all the parents.

Steps - 1> Intialization (One time run)

create temporary table hier_temp as select * from Hier;
create table if not exists res as select * from hier where false;

2) query to find lowest level child

insert into table res 
select
H1.parent, H1.Child
from hier_temp H1 left outer join hier_temp H2
on H1.Child=H2.Parent where H2.Child is null;

3) Overwrite temp table with next level of child

insert overwrite table hier_temp 
select
H1.Parent Parent, coalesce(H3.child, H2.child) as child
from hier_temp H1 left outer join hier_temp H2 on H1.Child=H2.Parent
left outer join res H3 on H2.child=H3.parent
 where H2.Child is not null;

Create a shell script which will execute steps 2 and 3 in sequence in loop (Conditional statement with break and continue will do the job) till we don't have any data in hier_temp table.

Below are results from res and hier_temp table for the given test data.

hive> select * from res;
OK
Time taken: 0.131 seconds
hive> select * from hier_temp;
OK
C1      C11
C11     C12
C12     123
P1      C1
P2      C2
Time taken: 0.108 seconds, Fetched: 5 row(s)

Results after loop1 for queries mentioned in step 2 and 3

hive> select * from res;
OK
C12     123
P2      C2
Time taken: 0.137 seconds, Fetched: 2 row(s)
hive> select * from hier_temp;
OK
P1      C11
C1      123
C11     123
Time taken: 0.155 seconds, Fetched: 3 row(s)

Results after loop2 for queries mentioned in step 2 and 3

hive> select * from res;
OK
C12     123
P2      C2
C1      123
C11     123
Time taken: 0.11 seconds, Fetched: 4 row(s)
hive> select * from hier_temp;
OK
P1      123
Time taken: 0.111 seconds, Fetched: 1 row(s)

Results loop3 for queries mentioned in step 2 and 3

hive> select * from res;
OK
P1      123
C12     123
P2      C2
C1      123
C11     123
Time taken: 0.115 seconds, Fetched: 5 row(s)
hive> select * from hier_temp;
OK
Time taken: 0.16 seconds

This will get you the desired results, but you might have to consider time taken for execution.

Hope this helps

Vijiy
  • 1,187
  • 6
  • 21
  • That you for the response! I will try this solution. – A Saraf Feb 04 '19 at 14:51
  • In output we are getting C1 --> C12 but the child node for C12 is 123. I am expecting C1--> 123 in the output. Please help to correct the query. – A Saraf Feb 05 '19 at 17:40
  • please help in modifying this query to get the desired result. – A Saraf Feb 06 '19 at 23:29
  • Can you check now, I have added one more join, Better to check with some more examples to make sure that there are no issues with the query – Vijiy Feb 07 '19 at 04:24
  • This works on sample data. I will try to run this approach on actual data set. Thank you so much!! – A Saraf Feb 07 '19 at 15:34