2

I am having two pretty large schema identical tables in MSSQL (100 million+ of records) and I need to create a SP that returns a unioned result set with a query such as the one defined below. The result can in some cases be just a few records and in other cases many millions depending on the input to the SP.

I also need to sort it on the "S"-value which is a Guid-like value i.e. non sequential.

When loading from one table alone (i.e. without UNION ALL) the sort can take advantage of the index and sort efficiently, but how is that affected by a UNION ALL?

The client application consuming the values want to have a stream of data i.e. we don't want to wait for all records to be read/loaded to be able to return the first set of "S"-matches.

-- Creates #distinctMatches temporary table etc. above..

CREATE CLUSTERED INDEX idx ON #distinctMatches (s)  

SELECT 
  'C' AS [source]
  ,P.[Id]
  ,P.[A]
  ,P.[B]
  ,P.[C]
  ,P.[D]
  ,P.[E]
  ,C.[S] AS [sortValue]
FROM 
    [dbo].[data_current] AS P
    INNER JOIN #distinctMatches AS C ON P.[s] = C.[s]

UNION ALL   

SELECT 
  'A' AS [source]
  ,P.[Id]
  ,P.[A]
  ,P.[B]
  ,P.[C]
  ,P.[D]
  ,P.[E]
  ,C.[S] AS [sortValue]
FROM 
    [dbo].[data_archive] AS P
    INNER JOIN #distinctMatches AS C ON P.[s] = C.[s]
ORDER BY [sortValue]

I have not been able to verify if this order by operation can take advantage of the index or not?

I have tried to read the query plans but fail to interpret them around this. Any ideas or alternative suggestions?

This is a legacy application so big schema/data changes or other major redesigns of the architecture is not possible at this point.

UPDATE based information from the answer below:

By changing

FROM 
    [dbo].[data_archive] AS P
    INNER JOIN #distinctMatches AS C ON P.[s] = C.[s]
ORDER BY [sortValue]

to

FROM 
    [dbo].[data_archive] AS P
WHERE EXISTS (SELECT C.[s] FROM #distinctMatches AS C WHERE P.[s] = C.[s])
ORDER BY [sortValue]

I was able to get a merge join instead of a concatenation. (See answer below).

jmw
  • 2,864
  • 5
  • 27
  • 32
  • Note that the column names/tables are just example names of the scenario i.e. it is not called a, b, c etc. – jmw Jun 02 '17 at 06:52
  • Are you aware that you only order the second result set? You would have to do a subquery like SELECT * FROM (Select... union Select...) order by ... In this case index will not be used. There are a few types of ordering algorithms in sql server, which can then be seen in the query plan. I suggest you should do some reading on those. – PacoDePaco Jun 02 '17 at 09:11
  • 1
    @PawełKucharski that is actually not correct, using UNION ALL the ORDER BY applies to the full set, not only to the last one. I will look into the ordering algorithms, thanks. – jmw Jun 02 '17 at 09:25
  • 1
    you are right, sorry. Also, the query execution plan is the same with a subquery and without it. I ran the union all query with an order by on clustered index column which resulted in a merge join in the order by operation, which is a very efficient algorithm. Hope it will work this way in your case as well. – PacoDePaco Jun 02 '17 at 09:54
  • why can't you sort the record in client apllication ?When there is one table involve then there is less record involve and optimizer uses index.Showing 100 million+ of records at one go is itself wrong.you have to apply paging mechanism. – KumarHarsh Jun 02 '17 at 10:43
  • 1
    A fundamental skill: learn to get and read [the execution plan](https://stackoverflow.com/q/7359702/562459). Every SQL dbms has one or more ways to do this. – Mike Sherrill 'Cat Recall' Jun 02 '17 at 10:56
  • @KumarHarsh because I don' want to load potentially millions of records in the client application to sort them, I need to get a stream of records that are read and then disposed. MikeSherrill'CatRecall' I agree with you completely, I do of course look at the execution plans, I just don't know how to read it regarding this aspect (the ordering/sort part of it). – jmw Jun 02 '17 at 20:54

1 Answers1

1

You said "When loading from one table alone (i.e. without UNION ALL) the sort can take advantage of the index and sort efficiently" so there is a PK on the table with leading key column S or at least a covering index with leading key column S, and this is true for both tables. This means all joins in your query are MERGE joins (the join to temporary table is on the sorting column as well), so there will not be any additional sort in the plan

sepupic
  • 8,409
  • 1
  • 9
  • 20
  • We have indexes that that contains the sort column (S) so as far as I understood it it should be enough to be able to start doing Key lookups as soon as the query optimizer knows that the first set of "S"-matches are found. Like described here: http://use-the-index-luke.com/sql/sorting-grouping/indexed-order-by I.e. the SORT operation is not even shown in the query plan, but I dont know how to achieve that using UNION ALL – jmw Jun 02 '17 at 21:45
  • No, it is not enough just to have sort colunm in index. It should be the leading column of the index, if it is not it will be sort operator in the plan – sepupic Jun 03 '17 at 02:28
  • You are right, that's what I actually meant, thanks for clairifying. – jmw Jun 03 '17 at 06:01