3

I am using Azure Data Factory V2 to transfer some csv files from Azure Data Lake to Azure Synapse

I have a loop to find all files in special folder on my DataLake.

enter image description here

After i have a DataFlow to transfer data from staging to main table.

enter image description here

In my for-each loop, at first i am cleaning my staging table by a SP then I am reading data from csv file (one by one). to transfer data from CVS to my staging table i am using Copy Data task. I am reading all columns as varchar and all columns in staging table are varchar(there is no casting here)

enter image description here

Each file has about 20 columns and about 216 rows.

I am wondering why for just three files my pipeline takes so much times?

enter image description here

this is my task for cleaning staging table.

enter image description here

This is my SQL server scale and usage.

I ran my pipeline exactly after resuming my Synapse service. that is only pipeline and service that work with my synapse.

enter image description here

This is my stored procedure :

enter image description here

enter image description here

enter image description here

CREATE PROCEDURE [stg].[...._Truncate]
AS
    TRUNCATE TABLE [stg].[....e]
GO

This is my DF

enter image description here

enter image description here

    SELECT 
        Convert(int,S.[MMSI]) AS [MMSI] ,
        Convert(int,S.[IMO] )  AS [IMO] ,
        Convert(int,S.[SHIP_ID] )AS [SHIP_ID] ,
        Convert(numeric(8, 5),S.[LAT] )  AS [LAT] ,
        Convert(numeric(8, 5),S.[LON] )  AS [LON] ,
        Convert(int,S.[SPEED] )  AS [SPEED] ,
        Convert(int,S.[HEADING] ) AS [HEADING] ,
        Convert(int,S.[COURSE] )  AS [COURSE] ,
        Convert(int,S.[STATUS] ) AS [STATUS] ,
        Convert(datetime,S.[TIMESTAMP] )  AS [TIMESTAMP] ,
        Convert(varchar(5),S.[DSRC] )  AS [DSRC] ,
        Convert(int,S.[UTC_SECONDS] ) AS [UTC_SECONDS] ,

           'M....._Simple' AS [ETL_CREATED_BY], 
           GETUTCDATE() AS [ETL_CREATED_DATE], 
           CONVERT(BIGINT, replace(CONVERT(VARCHAR, GETDATE(), 112), '/', '') + replace(CONVERT(VARCHAR, GETDATE(), 108), ':', '')) AS [ETL_PROCESS_ID]
    FROM [stg].[....e] AS s

This is my Derived columns

enter image description here

This is ending mapping in my data flow

enter image description here

Should I do somethings here ?

enter image description here

Ardalan Shahgholi
  • 11,967
  • 21
  • 108
  • 144
  • 2
    It looks like the Stored Procedure is your bottleneck, not the ADF pipeline. Can you share the code of the Stored Procedure? Also, the distribution/partition scheme of the Synapse table? – Joel Cochran Apr 30 '20 at 21:17
  • @JoelCochran I don’t see how you came to the conclusion it’s the sproc. Looks like the mapping data flow spin up time is the culprit. I assume the author has already set TTL on the integration runtime so that the mapping data flow compute can be reused in each loop? https://stackoverflow.com/a/57815081/5070440 – GregGalloway May 02 '20 at 01:28
  • I came to that conclusion because the original image he posted [prior to the current edit] showed the Stored procedure taking 13:41 inside a foreach loop that took 13:43. https://stackoverflow.com/posts/61532258/revisions - When you combine that with the DWU spike, I think the conclusion was reasonable. – Joel Cochran May 02 '20 at 13:38
  • @GregGalloway - based on the new image, you would appear to be correct as the first DF takes longer than the subsequent DF runs. – Joel Cochran May 02 '20 at 13:40
  • @JoelCochran I edited my post and put sp code. – Ardalan Shahgholi May 02 '20 at 18:45

1 Answers1

4

I think the question is getting a bit muddled here, so I'm going to attempt an answer. Understand that there are a lot of potential factors and my response is not intended to be an exhaustive review of Data Flow performance techniques.

First, let me sum up the project as I understand it. For each file in an ADLS folder, it appears you have the following:

  1. Stored Procedure to truncate a Synapse staging table.
  2. Copy activity to copy data from ADLS to Synapse staging table
  3. DataFlow to read the data from the Synapse staging table, process it, and Write it back to a different Synapse table
  4. Execute another pipeline to Archive the file.

From what I gather, it seems that this process is working. The question is in regards to the execution time of the Data Flow.

GENERAL performance guidelines to consider:

  • Since you are running multiple DFs in succession, use a Custom Integration Runtime with ComputeOptimized type, 4 Cores, and Time To Live (TTL) greater than 0. [Not too long though, as you will pay for the environment to be active during the TTL time.] NOTE: last I was aware, DF requires the Region to be "Auto Resolve".

enter image description here

  • Any time you WRITE to Synapse, make sure to define a Polybase Staging Storage Account:

enter image description here

  • Be aware of cross-region operations: network Latency can be a real killer AND costs you money. For fastest performance, the Storage, Data Factory, and Synapse resources should all be in the same data center.

  • Source and Sink partitioning CAN help with very large data sets and complex scenarios, but is a fairly tricky topic and (most likely) would not help in your scenario.

SPECIFIC to your posted scenario, I would consider redesigning the workflow. From a high level, you are doing the following for every (small) file:

  1. Clearing a Synapse table
  2. Writing from blob to Synapse
  3. Reading the data you just wrote from Synapse
  4. Writing the data back to Synapse [after some minimal processing].

My personal rule of thumb is to not use Data Flow to cross the Synapse boundary: if the operation is from Synapse to the same Synpase, then run the logic in Synapse. In other words, since the Source and Sink are both in Synapse, I would replace the Data Flow with a Stored Procedure.

Even better, I would consolidate the COPY activity and the Data Flow to a single Data Flow. The Data Flow can read the ADLS as a source, transform the data, and INSERT it to Synapse. This would remove the Staging table from the process. DF can even Archive the files after the operation: Data Flow Source options tab

Finally, I would seek to limit the number of Data Flow executions. Is there a reason you have to process this file by file? The Data Flow source could easily process the entire folder, even capturing the file name as a column if you have logic based on that value [see image above]. This would remove the need to run the Data Flow many times and could dramatically improve overall performance.

Joel Cochran
  • 7,139
  • 2
  • 30
  • 43