4

I 'think' that this problem is relating to the query optimization that Azure Data Lake Analytics does; but let's see...

I have 2 separate queries (TVFs) doing aggregations, and then a final Query to join the 2 together for final results. So ...

Table >  Header Query
Table >  Detail Query
Result = Header Query + Detail Query

To test the whole logic out, I run the minor queries separately with a filter, storing the results to file, and then use the hard files as sources for the final query; these are the total durations (minutes).

Header Query  1.4  (408 rows)
Detail Query  0.9  (3298 rows)
Final Query   0.9  (408 rows)

So I know that as a maximum, I can get my result in around 3.5 minutes. However, I don't really want to create new intermediary files. I want to use the TDFs directly to feed the final query.

With TDFs in the final query, the Job Graph gets to around 97% progress within about 1.5 minutes. But then, all hell breaks loose ! The last node is a Aggregate with 2,500 Vertices that says 16 minutes compute time. So my question ... WHY ??

Is this a case of me not understanding some fundamental concepts of how Azure works ?

So, can anyone explain what's going on? Any help appreciated.

Final Query:

@Header =
SELECT [CTNNumber],
       [CTNCycleNo],
       [SeqStart],
       [SeqEnd],
       [StartUTC],
       [EndUTC],
       [StartLoc],
       [StartType],
       [EndLoc],
       [EndType],
       [Start Step],
       [Start Ctn Status],
       [Start Fill Status],
       [EndStep],
       [End Ctn Status],
       [End Fill Status]
FROM [Play].[getCycles3]
     ("") AS X;


@Detail =
SELECT [CTNNumber],
       [SeqNo] AS [SeqNo],
       [LocationType],
       [LocationID],
       [BizstepDescription],
       [ContainerStatus],
       [FillStatus],
       [UTCTimeStampforEvent]
FROM [Play].[getRaw]
     ("") AS Z;

@result =
    SELECT
        H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
        ,COUNT([D].[SeqNo]) AS [SeqCount]
        //, COUNT(DISTINCT [LocationID]) AS [#Locations]
    FROM 
        @Header AS [H]
        INNER JOIN
        @Detail AS [D]
        ON 
        [H].[CTNNumber] == [D].[CTNNumber] 
    WHERE 
        [D].[SeqNo] >= [H].[SeqStart] AND
        [D].[SeqNo] <= [H].[SeqEnd]  
    GROUP BY 
        H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
    ;

enter image description here

SimonB
  • 962
  • 1
  • 14
  • 36
  • Pretty quiet on replies ... which means that I've either asked a dumb question or explained it all wrong. What can I do to gain some interest ? – SimonB Sep 29 '17 at 10:18

2 Answers2

1

My apologies for the late reply. I was travelling and this just slipped off the radar.

The problem is most likely that the optimizer got a wrong estimation and decided to over partition. If you read the data from the file, the optimizer has more precise information available at compile time.

Can you add an OPTION(ROWCOUNT=x) hint to the queries that are generating the input for the join to get a better plan?

Michael Rys
  • 6,684
  • 15
  • 23
1

So I entered this one as a ticket with Microsoft. Here is their response, which I implemented and was successful.

From: #######@microsoft.com Subject: RE: ########### USQL Job displays bizarre job plan and runtime Back

So the issue here is cardinality. When the script gets broken into parts the U-SQL compiler has a new input to work with at each intermediate step, and it knows the actual data size and distribution of that new input, so the optimizer is able to accurately allocate resources.

However, when the script is put all into one, the optimizer’s estimation is very different, because it doesn’t know what the exact size and distribution of those intermediate steps is going to be. So we expect there to be some difference in estimations between a script that has been broken into parts and a script that is all in one.

Your best defense against optimization differences like these is to CREATE STATISTICS on table columns. In this case you should CREATE STATISTICS on the CTNNumber column and see if that improves performance on the single job script.

Here is some documentation surrounding CREATE STATISTICS: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-statistics-transact-sql

SimonB
  • 962
  • 1
  • 14
  • 36