Joining two tables with different sizes takes hours to execute

Question

Here is the scenario: we are using Azure Synapse analytics with 100 DTU, we have several tables. All tables are round robin distributed and clustered columnstore because data is loaded once in a day and we want it to be quick. All these tables are big having 50+ columns and millions of records (around 30m) There is one load_table table which contains information about last load of all the tables.

SQL Query1:

SELECT load_time FROM load_table WHERE table = 't2';

when I execute this query on table, this gives me one timestamp as output for that table. format is like this "YYYY-MM-DD TT". This query normally runs fine within 5-10 secs.

SQL Query2:

SELECT top 100 * FROM t2;

t2 is big fact table containing millions of records having 50+ cols. When I run this query it takes 2-3 mins.

Problem occurs when I join these two tables and it takes 2-3 hrs to execute. I want the result like loadtime as first column and all columns from second fact table.

joining is like below


SELECT t1.load_time, t2.* FROM
(SELECT load_time FROM load_table WHERE table = 't2') as t1 INNER JOIN t2
on 1 = 1;

Output look like below.

How can I achieve this result another way?

I am expecting below results

I have checked statistics of fact table those are updated I have checked skew and it is fine

why do use condition 1=1 ?why do you load whole data?can you reduce rows of load data and columns of t2?why do you use t2.col1,col2,col3,col4,col5,col7,col8 instead of t2.*?(if you only need 8 cols) — abolfazl sadeghi, May 07 '23 at 16:47
Why not put load time into variable and select that together with your big table. — siggemannen, May 07 '23 at 16:55
For query performance questions, we need at a minimum: the table *and index* definitions, and please share the query plan via https://brentozar.com/pastetheplan. The question is not answerable otherwise. — Charlieface, May 07 '23 at 16:57
Since it's small, LOAD_TABLE should be REPLICATED not ROUND_ROBIN. See Replicated Table Guidance: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables — David Browne - Microsoft, May 07 '23 at 19:21
@abolfazlsadeghi i used condition 1=1 because i want all the data. also it is just a view. and enduser frequently does top 10, top 100 even that also takes hours to execute. — veer kumar rathore, May 08 '23 at 06:41
@abolfazlsadeghi yes i am already doing t2.*, its me who did not put the query properly in question. and i also want all cols — veer kumar rathore, May 08 '23 at 06:50
@siggemannen i will definitely try this and let you know the results — veer kumar rathore, May 08 '23 at 06:51
@Charlieface table is clustered columnstore indexed. sure i will share the query plan. — veer kumar rathore, May 08 '23 at 06:53
@DavidBrowne-Microsoft yes it a small table. i will check the distribution of this table if it is ROUND ROBIN then i will change it to RELICATED and give it a try. — veer kumar rathore, May 08 '23 at 06:56
is there any way to check the current Data Warehouse units in Synapse without connecting to master? (as i don't have access to master). i have checked the tempdb allocation and found pdw_node_id and there are 7 pdw node id's, does this mean size is DW3000c ? if so then why not 6 ids are showing. and also there is one pdw node 7th having more tempdb allocated than others — veer kumar rathore, May 09 '23 at 17:39

score 1 · Answer 1 · answered May 07 '23 at 18:42

1

your join-condition is a bit...special. If your load_table only returns one value per table ('t2'), then you can just try this:

SELECT TOP 100 
t1.load_time,
t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM load_table as t1 
CROSS APPLY t2 
WHERE t1.table='t2'

and if you use a stored procedure, you can first save the value to a variable:

DECLARE @TableTimestamp datetime;
SELECT TOP 1 @TableTimestamp = load_time from load_table
;
SELECT TOP 100
@TableTimeStamp as LoadTime
,t2.[fact.col1],[fact.col2],...,[fact.col8]
FROM t2

answered May 07 '23 at 18:42

devale

41
4

thank you for answer. i will implement your answer and let you know the resutls. – veer kumar rathore May 08 '23 at 06:44
i tried these two methods but results are still the same. query is still taking more than 2 hrs – veer kumar rathore May 08 '23 at 15:03

KumarHarsh · Answer 2 · 2023-05-08T15:38:46.857

when I execute this query on table, this gives me one timestamp as output for that table

when it return only one row and one column then why not take this in variable. Declare @load_time datetime

SELECT @load_time=load_time FROM load_table WHERE table = 't2';

Now,

SELECT @load_time as load_time, t2.* FROM t2 (NOLOCK)

i) DO not use * and mention all the column name which you need in your output.

ii) Use nolock only when there is no chance of READ UNCommitted or it do not matter.

iii) Why you will use million of data,why not limit your resultset by using Pagination or top clause.

iv) Of course , we need at a minimum: the table and index definition for further investigation.also share your real query

if you simply use

select * from t2

which has millions of rows it will always take time. Also, select top 100 * from t2 will also take time because it is without order by clause,sql optimizer is not sure to order on which column to get those 100 rows.

So you have to share table scheme along with data type.Also if there is any index on table or unique key define.

Then accordingly we can do Paging to get only selected rows.

i did what you suggested, but query still takes huge time – veer kumar rathore May 08 '23 at 15:04 — veer kumar rathore, May 08 '23 at 15:04
@veerkumarrathore,Read my edited carefully – KumarHarsh May 08 '23 at 15:39 — KumarHarsh, May 08 '23 at 15:39

Joining two tables with different sizes takes hours to execute

2 Answers2