4

I am trying to load the data from my External Tables to SQL DW Internal tables. I have the data stores in a compressed format in BLOB Storage and External tables are pointed to the BLOB Storage Location.

I have around 24 files, which is around 22GB of size and trying to load the data from External table to a Internal table on 300 DWU with a largerc resource class service/user account.

My insert into statement ( which is very straight forward) is running for more than 10 hours. insert into Trxdata.Details_data select * from Trxdata.Stage_External_Table_details_data;

I also tried with below statement, thats also running for more than 10 hours. CREATE TABLE Trxdata.Details_data12 WITH ( DISTRIBUTION = ROUND_ROBIN ) AS SELECT * FROM Trxdata.Stage_External_Table_details_data ;

I see - both the SQLs are running with ACTIVE status in "sys"."dm_pdw_exec_requests" [ I was thinking, it may be concurrency slot issue and it hasnt got concurrency slots to run, but its not the case]

and I was hoping , increasing/scaling up DWU - might improve the performance. but looking at the DWU usage in portal.azure.com - I am not convinced to increased the DWU because the DWU usage chart shows <50DWU for the last 12 hours

DWU USage chart

So, I am trying to understand- how can I find - what is taking such a long time, How can I improve the performance of my data load ?

Aravind
  • 179
  • 1
  • 2
  • 14
  • Another quick stats to share, When I tried processing 7 compressed [ LZ4 format ] files - sizing 6.63GB , containing 455 Million records completed in 115 minutes [ 400 DWU, largerc resource class, running through SSMS , insert statement used, running from a Windows Azure Virtual Machine which is located in the same location as BLOB Storage and SQL DW , BLOB + SQL DW + VM are on same location in Azure ] – Aravind Aug 20 '16 at 03:39

3 Answers3

2

I suspect your problem lies with the file(s) being compressed. Many azure documents state that you will only get one reader per compressed file. As a test I would suggest you decompress your data and try a load and see if decompressing/load is faster than then 10 hours loading compressed data you are currently seeing. I also have better luck with several files rather than 1 large file, if that is an option for your system.

1

Please have a look at the below blog from SQL CAT on data loading optimizations. https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/

Based on the info provided, a couple things to consider are:

1) Locality of the blob files compared to the DW instance. Make sure they are in the same region. 2) Clustered Columnstore is on by default. If you are loading 22GB of data, a HEAP load may perform better (but not sure on row count either). So:

CREATE TABLE Trxdata.Details_data12 
WITH (HEAP, DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM Trxdata.Stage_External_Table_details_data ;

If the problem still persists, please file a support ticket: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-create-support-ticket/

  • Hi Ron. 1. Yes, BLOB and SQL DW is in same location(US WEST) 2. 22GB compressed files are expected to contain approx 1.1 Billion records , I have initiated the load now , let me see- how that goes – Aravind Aug 20 '16 at 03:22
  • I also tried running CREATE TABLE WITH HEAP, no much luck, INSERT INTO took (to a column store table)- took 4 hr 27 minutes , where as CREATE WITH HEAP took 4 hr 50 minutes – Aravind Aug 20 '16 at 11:14
0

You mention that the data is in a compressed format. How many compressed files does the data reside in? For compressed files, you'll achieve more parallelism and thus better performance when the data is spread across many files. Having the data in multiple files is not needed for uncompressed files in order to achieve better performance, so another way to test if this is your performance issue is to un-compress your files.

Sonya Marshall
  • 506
  • 2
  • 4
  • Hi Sonya, One file in one compressed file(Its in LZ4 compressed format). Altogether my 24 files contains 1.1B records . and decompressing every file is little more time consuming as I have altogether 512 files( I started with 24 files to evaluate the performance) and each file is approx 1GB and uncompressed size of each file will be around 10GB – Aravind Aug 20 '16 at 03:29
  • It might help to look a the article https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-manage-monitor/ to verify the long running step, which presumably will be the HadoopRoundRobinMoveOperation and to verify that the DMS readers and writers (sys.dm_pdw_dms_workers) are all taking the same amount of time, that you don't have some sort of processing skew. If you can, it would be a good test to uncompress the files to see if you get dramatically different performance. – Sonya Marshall Aug 20 '16 at 15:58
  • Another thing which can affect performance is defining columns much wider than they need to be. https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-best-practices/#use-the-smallest-possible-column-size – Sonya Marshall Aug 20 '16 at 16:02
  • thanks. data types/precision are not much wider. For now, we are continuing with LZ4 formats directly loading to SQL DW Internal table(which si taking longg time), but still a curious question that I have is - why is the DWU unit chart is not GOING HIGH even when I try to load huge data and the data load is taking long time – Aravind Aug 26 '16 at 17:54