Convert subselect to a join

Question

I seem to understand that Join is preferred to sub-select. I'm unable to see how to turn the 3 sub-selects to joins.

My sub-selects fetch the first row only

I'm perfectly willing to leave this alone if it is not offensive SQL.

This is my query, and yes, those really are the table and column names

select x1.*, x2.KTNR, x3.J6NQ
from 
    (select D0HONB as HONB, D0HHNB as HHNB, 
        (
            select DHHHNB 
            from ECDHREP 
            where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD 
            order by DHEJDT desc 
            FETCH FIRST 1 ROW ONLY
        ) as STC_HHNB,
        (
            select FIQ9NB 
            from DCFIREP 
            where FIQ7NB = D0Q7NB 
              AND FIBAEQ = D0ATEQ 
              and FISQCD = D0KNCD 
              and FIGZSZ in ('POS', 'ACT', 'MAN', 'HLD') 
            order by FIYCNB desc 
            FETCH FIRST 1 ROW ONLY
        ) as BL_Q9NB,
        (
            select AAKPNR 
            from C1AACPP 
            where AACEEQ = D0ATEQ and AARCCE = D0KNCD and AARDCE = D0KOCD 
            order by AAHMDT desc, AANENO desc 
            FETCH FIRST 1 ROW ONLY 
        ) as NULL_KPNR  
        from ECD0REP
    ) as x1 
left outer join (
        select AAKPNR as null_kpnr, max(ABKTNR) as KTNR 
        from C1AACPP 
          left outer join C1ABCPP on AAKPNR = ABKPNR 
        group by AAKPNR
    ) as X2 on x1.NULL_KPNR = x2.null_KPNR 
left outer join (
        select ACKPNR as KPNR, count(*) as J6NQ 
        from C1ACCPP 
        WHERE ACJNDD = 'Y' 
        group by ACKPNR
    ) as X3 on x1.NULL_KPNR = x3.KPNR

I don't find it to be particularly "offensive". And I can't imagine you'd gain much in terms of performance by converting it to use joins vs. sub-selects. If you just want to experiment with a slightly different approach, you could declare and set variables in place of the sub-selects, then just reference them in the main query. I had to look for the syntax to do that in db2/400 - see example [here](https://stackoverflow.com/questions/6216717/declare-a-variable-in-db2-sql?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). — dgg, Mar 29 '18 at 15:07

Charles · Accepted Answer · 2018-03-29T23:58:53.747

You've got a combination of correlated subselects and nested table expressions (NTE).

Personally, I'd call it offensive if I had to maintain it. ;)

Consider common table expressions & joins...without your data and tabvle structure, I can't give you the real statement, but the general form would look like

with 
   STC_HHNB as (
     select DHHHNB, DHAOEQ, DHJRCD, DHEJDT
      from ECDHREP )
,  BL_Q9NB as ( <....>
               where FIGZSZ in ('POS', 'ACT', 'MAN', 'HLD'))
<...>
select <...>
from stc_hhb
     join blq9nb on <...>

Two important reasons to favor CTE over NTE...the results of a CTE can be reused Also it's easy to build a statement with CTE's incrementally.

By re-used, I mean you can have

with 
  cte1 as (<...>)
  , cte2 as (select <...> from cte1 join <...>)
  , cte3 as (select <...> from cte1 join <...>)
  , cte4 as (select <...> from cte2 join cte3 on <...>)
select * from cte4;

The optimizer can choose to build a temporary results set for cte1 and use it multiple times. From a building standpoint, you can see I'm builing on each preceding cte.

Here's a good article https://www.mcpressonline.com/programming/sql/simplify-sql-qwithq-common-table-expressions

Edit

Let's dig into your first correlated sub-query.

select D0HONB as HONB, D0HHNB as HHNB, 
      (
          select DHHHNB 
          from ECDHREP 
          where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD 
          order by DHEJDT desc 
          FETCH FIRST 1 ROW ONLY
        ) as STC_HHNB
from ECD0REP

What you asking the DB to do is for every row read in ECD0REP, go out and get a row from ECDHREP. If you're unlucky, the DB will have to read lots of records in ECDHREP to find that one row. Generally, consider that with correlated sub-query the inner query would need to read every row. So if there's M rows in the outer and N rows in the inner...then you're looking at MxN rows being read.

I've seen this before, especially on the IBM i. As that's how an RPG developer would do it

read ECD0REP;
dow not %eof(ECD0REP);
  //need to get DHHHNB from ECDHREP 
  chain (D0ATEQ, D0KNCD) ECDHREP;
  STC_HHNB = DHHHNB;
  read ECD0REP;
enddo;

But that's not the way to do it in SQL. SQL is (supposed to be) set based.

So what you need to do is think of how to select the set of records out of ECDHREP that will match up to the set of record you want from ECD0REP.

with cte1 as (
  select DHHHNB, DHAOEQ, DHJRCD
  from ECDHREP
)
select D0HONB as HONB
     , D0HHNB as HHNB
     , DHHHBN as STC_HHNB
from ECD0REP join cte1 
      on DHAOEQ = D0ATEQ and DHJRCD = D0KNCD

Now maybe that's not quite correct. Perhaps there's multiple rows in ECDHREP with the same values (DHAOEQ, DHJRCD); thus you needed the FETCH FIRST in your correlated sub-query. Fine you can focus on the CTE and figure out what needs to be done to get that 1 row you want. Perhaps MAX(DHHHNB) or MIN(DHHHNB) would work. If nothing else, you could use ROW_NUMBER() to pick out just one row...

with cte1 as (
  select DHHHNB, DHAOEQ, DHJRCD
         , row_number() over(partition by DHAOEQ, DHJRCD
                             order by DHAOEQ, DHJRCD)
            as rowNbr  
  from ECDHREP
), cte2 as (
   select DHHHNB, DHAOEQ, DHJRCD
     from cte1
    where rowNbr = 1
)
select D0HONB as HONB
     , D0HHNB as HHNB
     , DHHHBN as STC_HHNB
from ECD0REP join cte2
      on DHAOEQ = D0ATEQ and DHJRCD = D0KNCD

Now you're dealing with sets of records, joining them together for your final results.

Worse case, the DB has to read M + N records.

It's not really about performance, it's about thinking in sets.

Sure with a simple statement using a correlated sub-query, the optimizer will probably be able to re-write it into a join.

But it's best to write the best code you can, rather then hope the optimizer can correct it.

I've seen and rewritten queries with 100's of correlated & regular sub-queries....in fact I've seen a query that had to be broken into 2 because there were two many sub-queries. The DB has a limit of 256 per statement.

My offensive query has been running slowly. I was thinking on these lines somehow and the "with" syntax being new to me, I can now see how. You example illustrates and solidifies what and how. Thank you eh... you will just have to suck up your offense at my correlated sub-queries... I ain't apologizing. :-D — Christopher Smith, Mar 30 '18 at 06:24
lol...so's Mark...so am I for that matter. But don't get me wrong. I still use RPG as needed. It's just set based SQL is often a better choice for data access. — Charles, Mar 30 '18 at 14:54
Just be careful, it (and `FETCH FIRST`) often becomes a band aid to cover data issue or table design issues. Long term better to fix the problem. If `(DHAOEQ, DHJRCD)` is unique except for a few bad rows. Fix the rows and add a constraint. Alternately, normalize the data, pull `(DHAOEQ, DHJRCD)` and other duplicated columns into their own table. — Charles, Mar 30 '18 at 15:16
Dude... This application has over 1000 tables and 20,000 RPG programs. It is a giant ball of mud. It is over 25 years old and belongs to a client . There are a plethora data issue or table design issues. And my team is converting it to a Java Web application with JDBC SQL replacing RPG IO. Thank you for the good thoughts. — Christopher Smith, Mar 30 '18 at 19:32
I finished the Query, now I have to use it to make a view I'm not sure if it is any less offensive or funs much better — Christopher Smith, Apr 04 '18 at 16:11
@ChristopherSmith do you really need every field from BKA9REP? And why do you need them all twice? BOOK_EQUIPMENT_FOR_LOG & BOOK_EQUIPMENT_FOR_BID? You can reference the same table (or CTE) twice in an SQL statement.. — Charles, Apr 04 '18 at 17:30
No, I don't need every field. I was just too fed up and lazy to cut them down. I used it twice to get a easier to read, friendly name so I could keep track each one. 2E RPG naming of tables, views and columns is tiring. I still have not figured out how to create the view. It keeps complaining — Christopher Smith, Apr 04 '18 at 20:45
here is the final query that creates a view https://pastebin.com/iFTPGBtQ . In my app, it selects about 100 rows from the view in about 16.6 seconds. The old query took about 6 seconds. any thoughts about why? — Christopher Smith, Apr 06 '18 at 18:01
@ChristopherSmith Run both through Visual Explain to see the plans...that should give you an idea. — Charles, Apr 06 '18 at 19:33

jmarkmurphy · Answer 2 · 2018-03-30T12:05:45.047

I'm going to have to differ with Charles here if the FETCH FIRST 1 ROW ONLY clauses are necessary. In this case you likely can't pull those sub-selects out into a CTE because that CTE would only have a single row in it. I suspect you could pull the outer sub-select into a CTE, but you would still need the sub-selects in the CTE. Since there appears to be no sharing, I would call this personal preference. BTW, I don't think pulling the sub-selects into a join will work for you either, in this case, for the same reason.

What is the difference between a sub-select and a CTE?

with mycte as (
  select field1, field2 
  from mytable
  where somecondition = true)
select * 
from mycte

vs.

select * 
from (select field1, field2 
      from mytable
      where somecondition = true) a

It's really just a personal preference, though depending on the specific requirements, a CTE can be used multiple times within the SQL statement, but a sub-select will be more correct in other cases like the FETCT FIRST clause in your question.

EDIT
Let's look at the first sub-query. With the appropriate index:

(
    select DHHHNB 
    from ECDHREP 
    where DHAOEQ = D0ATEQ and DHJRCD = D0KNCD 
    order by DHEJDT desc 
    FETCH FIRST 1 ROW ONLY
) as STC_HHNB,

only has to read one record per row in the output set. I don't think that is terribly onerous. This is the same for the third correlated sub-query as well.

That index on the first correlated sub-query would be:

create index ECDHREP_X1 
  on ECDHREP (DHAOEQ, DHJRCD, DHEJDT);

The second correlated sub-query might need more than one read per row, just because of the IN predicate, but it is far from needing a full table scan.

I've seen a lot of `FETCH FIRST 1 ROW` used when not really needed. especially in correlated sub-queries. And there's a potentially huge performance hit with a correlated sub-queries. If the DB can't rewrite it, your asking the DB to read another table for every row in the top-level table.... — Charles, Mar 29 '18 at 20:39
Yes there might need to be a lot of individual reads, but it shouldn't need to be new tables. — jmarkmurphy, Mar 29 '18 at 21:56
yes it's the same table, but could still be lots of disk I/O. Assuming an index exists and the row in question can be returned with a single I/O each row may require that I/O. Unless you get lucky and multiple rows are in the same block. In a correlated sub-select without the `FETCH FIRST` and without an index, you're conceivably looking at a full table scan of the inner table for each row in the outer table. — Charles, Mar 29 '18 at 22:02
Deal with performance later. First make it work. Maybe the `fetch first` isn't necessary. Maybe it is. — jmarkmurphy, Mar 29 '18 at 22:08
usually yes...but in this case no. Step back from record by record processing and try to think in sets. — Charles, Mar 29 '18 at 22:10

Convert subselect to a join

2 Answers2