I 'think' that this problem is relating to the query optimization that Azure Data Lake Analytics does; but let's see...
I have 2 separate queries (TVFs) doing aggregations, and then a final Query to join the 2 together for final results. So ...
Table > Header Query
Table > Detail Query
Result = Header Query + Detail Query
To test the whole logic out, I run the minor queries separately with a filter, storing the results to file, and then use the hard files as sources for the final query; these are the total durations (minutes).
Header Query 1.4 (408 rows)
Detail Query 0.9 (3298 rows)
Final Query 0.9 (408 rows)
So I know that as a maximum, I can get my result in around 3.5 minutes. However, I don't really want to create new intermediary files. I want to use the TDFs directly to feed the final query.
With TDFs in the final query, the Job Graph gets to around 97% progress within about 1.5 minutes. But then, all hell breaks loose ! The last node is a Aggregate with 2,500 Vertices that says 16 minutes compute time. So my question ... WHY ??
Is this a case of me not understanding some fundamental concepts of how Azure works ?
So, can anyone explain what's going on? Any help appreciated.
Final Query:
@Header =
SELECT [CTNNumber],
[CTNCycleNo],
[SeqStart],
[SeqEnd],
[StartUTC],
[EndUTC],
[StartLoc],
[StartType],
[EndLoc],
[EndType],
[Start Step],
[Start Ctn Status],
[Start Fill Status],
[EndStep],
[End Ctn Status],
[End Fill Status]
FROM [Play].[getCycles3]
("") AS X;
@Detail =
SELECT [CTNNumber],
[SeqNo] AS [SeqNo],
[LocationType],
[LocationID],
[BizstepDescription],
[ContainerStatus],
[FillStatus],
[UTCTimeStampforEvent]
FROM [Play].[getRaw]
("") AS Z;
@result =
SELECT
H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
,COUNT([D].[SeqNo]) AS [SeqCount]
//, COUNT(DISTINCT [LocationID]) AS [#Locations]
FROM
@Header AS [H]
INNER JOIN
@Detail AS [D]
ON
[H].[CTNNumber] == [D].[CTNNumber]
WHERE
[D].[SeqNo] >= [H].[SeqStart] AND
[D].[SeqNo] <= [H].[SeqEnd]
GROUP BY
H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
;