1

The query below is taking about a minute. I believe the poor performance is caused by the two "IN (SELECT..." clauses. I have a table of terms where one may be connected to another via the term_relationship table. These relationships can be recursive, e.g. dog is a type of mammal, mammal is a type of animal. This recursion could be any depth but probably not more than ~10 levels. I am trying to select all terms which have a (potentially recursive) relationship with type A and have a (potentially recursive) relationship with type B. I think that replacing the two "IN (SELECT..." clauses with restrictions on the outer query would improve performance but cannot figure out how to do this using the CONNECT BY clauses. Can anyone help with this?

SELECT term_name
FROM term
WHERE term_id IN 
   (SELECT term_id 
    FROM term_relationship 
    START WITH related_term_id = 123
    CONNECT BY NOCYCLE PRIOR term_id = related_term_id)
AND term_id IN 
   (SELECT term_id 
    FROM term_relationship 
    START WITH related_term_id = 456
    CONNECT BY NOCYCLE PRIOR term_id = related_term_id)
Dan Newman
  • 43
  • 7

1 Answers1

2

Instead of doing the same CONNECT BY query twice with only differing start values how about you do it once by providing both start values to one instance of the subquery. This change will get you all the term_ids related to either of your starting values, however, you want only those term_ids related to both of your starting values. To get that, you then need to group the results by by term_id and limit to those results having a count of more than one:

SELECT term_name
  FROM term
 WHERE term_id IN 
    (SELECT term_id 
       FROM term_relationship 
      START WITH related_term_id in (123, 456)
    CONNECT BY NOCYCLE PRIOR term_id = related_term_id
      group by term_id having count(*) >= 2)

Edit
With the above code I made an assumption about your data that may not be correct. I assumed a tree like structure where you were starting with nodes on a branch and traveling up towards the root like in diagram A, however, if your data looks like diagram B, then the above query will fail if you start at nodes 7 and 9 as node 7 has two paths back to node 1, and the above query would return node 1 twice, thereby misidentifying it as a common node.

A)   -(1)-                    B)   -(1)-
    /  |  \     (8)               /  |  \     (8)
  (2)  |  (3)    |              (2)  |  (3)    |
   |  (4)  |    (9)              |  (4)  |    (9)
  (5)     (6)                   (5)     (6)
           |                      \     /
          (7)                      -(7)-

The query below corrects for this and will correctly identify that for starting nodes 7 and 9 no nodes are in common, however, with starting nodes 7 and 4 node 1 is identified as a common node:

SELECT term_name
  FROM term
 WHERE term_id IN 
    (SELECT term_id 
     FROM term_relationship 
    START WITH related_term_id in (123, 456)
  CONNECT BY NOCYCLE PRIOR term_id = related_term_id
    group by term_id
   having count(distinct connect_by_root related_term_id) >= 2)
Sentinel
  • 6,379
  • 1
  • 18
  • 23
  • Thanks for the reply. The data looks like diagram B. Unfortunately providing both start values to one instance of the subquery decreases performance. I tested it with two real start values. The original query took 26 seconds and the query with a single subquery took 33 seconds. I think this is because the query optimizer is running the subquery that will remove the most rows before running the other subquery. – Dan Newman Feb 11 '19 at 10:28