Slow Azure Data Factory Pipeline

Question

I am using Azure Data Factory V2 to transfer some csv files from Azure Data Lake to Azure Synapse

I have a loop to find all files in special folder on my DataLake.

After i have a DataFlow to transfer data from staging to main table.

In my for-each loop, at first i am cleaning my staging table by a SP then I am reading data from csv file (one by one). to transfer data from CVS to my staging table i am using Copy Data task. I am reading all columns as varchar and all columns in staging table are varchar(there is no casting here)

Each file has about 20 columns and about 216 rows.

I am wondering why for just three files my pipeline takes so much times?

this is my task for cleaning staging table.

This is my SQL server scale and usage.

I ran my pipeline exactly after resuming my Synapse service. that is only pipeline and service that work with my synapse.

This is my stored procedure :

CREATE PROCEDURE [stg].[...._Truncate]
AS
    TRUNCATE TABLE [stg].[....e]
GO

This is my DF

    SELECT 
        Convert(int,S.[MMSI]) AS [MMSI] ,
        Convert(int,S.[IMO] )  AS [IMO] ,
        Convert(int,S.[SHIP_ID] )AS [SHIP_ID] ,
        Convert(numeric(8, 5),S.[LAT] )  AS [LAT] ,
        Convert(numeric(8, 5),S.[LON] )  AS [LON] ,
        Convert(int,S.[SPEED] )  AS [SPEED] ,
        Convert(int,S.[HEADING] ) AS [HEADING] ,
        Convert(int,S.[COURSE] )  AS [COURSE] ,
        Convert(int,S.[STATUS] ) AS [STATUS] ,
        Convert(datetime,S.[TIMESTAMP] )  AS [TIMESTAMP] ,
        Convert(varchar(5),S.[DSRC] )  AS [DSRC] ,
        Convert(int,S.[UTC_SECONDS] ) AS [UTC_SECONDS] ,

           'M....._Simple' AS [ETL_CREATED_BY], 
           GETUTCDATE() AS [ETL_CREATED_DATE], 
           CONVERT(BIGINT, replace(CONVERT(VARCHAR, GETDATE(), 112), '/', '') + replace(CONVERT(VARCHAR, GETDATE(), 108), ':', '')) AS [ETL_PROCESS_ID]
    FROM [stg].[....e] AS s

This is my Derived columns

This is ending mapping in my data flow

Should I do somethings here ?

It looks like the Stored Procedure is your bottleneck, not the ADF pipeline. Can you share the code of the Stored Procedure? Also, the distribution/partition scheme of the Synapse table? — Joel Cochran, Apr 30 '20 at 21:17
@JoelCochran I don’t see how you came to the conclusion it’s the sproc. Looks like the mapping data flow spin up time is the culprit. I assume the author has already set TTL on the integration runtime so that the mapping data flow compute can be reused in each loop? https://stackoverflow.com/a/57815081/5070440 — GregGalloway, May 02 '20 at 01:28
I came to that conclusion because the original image he posted [prior to the current edit] showed the Stored procedure taking 13:41 inside a foreach loop that took 13:43. https://stackoverflow.com/posts/61532258/revisions - When you combine that with the DWU spike, I think the conclusion was reasonable. — Joel Cochran, May 02 '20 at 13:38
@GregGalloway - based on the new image, you would appear to be correct as the first DF takes longer than the subsequent DF runs. — Joel Cochran, May 02 '20 at 13:40

score 4 · Accepted Answer · answered May 04 '20 at 16:17

I think the question is getting a bit muddled here, so I'm going to attempt an answer. Understand that there are a lot of potential factors and my response is not intended to be an exhaustive review of Data Flow performance techniques.

First, let me sum up the project as I understand it. For each file in an ADLS folder, it appears you have the following:

Stored Procedure to truncate a Synapse staging table.
Copy activity to copy data from ADLS to Synapse staging table
DataFlow to read the data from the Synapse staging table, process it, and Write it back to a different Synapse table
Execute another pipeline to Archive the file.

From what I gather, it seems that this process is working. The question is in regards to the execution time of the Data Flow.

GENERAL performance guidelines to consider:

Since you are running multiple DFs in succession, use a Custom Integration Runtime with ComputeOptimized type, 4 Cores, and Time To Live (TTL) greater than 0. [Not too long though, as you will pay for the environment to be active during the TTL time.] NOTE: last I was aware, DF requires the Region to be "Auto Resolve".

Any time you WRITE to Synapse, make sure to define a Polybase Staging Storage Account:

Be aware of cross-region operations: network Latency can be a real killer AND costs you money. For fastest performance, the Storage, Data Factory, and Synapse resources should all be in the same data center.
Source and Sink partitioning CAN help with very large data sets and complex scenarios, but is a fairly tricky topic and (most likely) would not help in your scenario.

SPECIFIC to your posted scenario, I would consider redesigning the workflow. From a high level, you are doing the following for every (small) file:

Clearing a Synapse table
Writing from blob to Synapse
Reading the data you just wrote from Synapse
Writing the data back to Synapse [after some minimal processing].

My personal rule of thumb is to not use Data Flow to cross the Synapse boundary: if the operation is from Synapse to the same Synpase, then run the logic in Synapse. In other words, since the Source and Sink are both in Synapse, I would replace the Data Flow with a Stored Procedure.

Even better, I would consolidate the COPY activity and the Data Flow to a single Data Flow. The Data Flow can read the ADLS as a source, transform the data, and INSERT it to Synapse. This would remove the Staging table from the process. DF can even Archive the files after the operation:

Finally, I would seek to limit the number of Data Flow executions. Is there a reason you have to process this file by file? The Data Flow source could easily process the entire folder, even capturing the file name as a column if you have logic based on that value [see image above]. This would remove the need to run the Data Flow many times and could dramatically improve overall performance.

I used SP and it improved performance. Thank you for your beautiful answer. — Ardalan Shahgholi, May 05 '20 at 14:27

Slow Azure Data Factory Pipeline

1 Answers1