Why is Polybase slow for large compressed files that span 1 billion records?

Question

What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?

For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.

Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.

What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?

When you mean degrade - do you mean get slower over time or just slow in general? — Murray Foxcroft, Feb 20 '17 at 11:47
What resource class are you using? Consider using [largerc](https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices#use-larger-resource-class-to-improve-query-performance) to improve performance, at the cost of reduced concurrency. If you are connected as the default admin user then their resource class will be small by default and can't be changed. DWU400 is pretty low for doing anything, why not 1000, 2000 or 6000 temporarily, then lower it when your CTAS is done? This is one of the really useful features of Azure SQL Data Warehouse, along with pause. — wBob, Feb 21 '17 at 14:00
We're on xlarge. We found out the admin was small the hard way (why is that?). We're limited to 400 DWU as part of the free credit. We're trying to revert this, but we now have to recreate to do pay-as-you-go. We're trying that next to see if scaling to 6000 DWU has any impact. — Fastidious, Feb 22 '17 at 17:54
Some other best practices would be, do not have multiple files per zip file, the number of files should be greater than or equal to the total number of readers of your service level objective. At DWU400 the max external readers is 32. At DWU6000, the max external readers is 480. You might also experiment with mediumrc for potentially increased concurrency. Please report back any findings you have as they will be useful for the thread! — wBob, Feb 23 '17 at 12:13

Murray Foxcroft · Accepted Answer · 2017-02-21T09:27:09.287

1

In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.

Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.

If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.

Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.

You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.

Also look in to:

CTAS (Create Table as Select), the fastest way to move data from external tables in to internal storage in Azure Data Warehouse.
Creating statistics for your external tables if you are going to query them repeatedly. SQL Data Warehouse does not create statistics automatically like SQL Server and you need to do this yourself.

edited Feb 21 '17 at 09:27

answered Feb 20 '17 at 12:01

Murray Foxcroft

12,785
7
58
86

Maybe I'm not fully understanding. To load data into Azure, I need to add the files to Blob storage. I do this with Azcopy. Once in storage, I setup the API keys in Azure and create an external table on the files. Then the only way to load the data is to query the external table into a internal table correct? Is there any other way? I assume my only option here is to scale DWU, which I didn't know impacted Polybase. And I was told Gzip was faster to query, but it sounds like that may not be true for large datasets. – Fastidious Feb 20 '17 at 13:40
My DWU at the time of this query is 400 DWU for reference. – Fastidious Feb 20 '17 at 13:41
Gzip adds a layer of overhead, it has to be unpacked before it is processed. The cost of storage is cheap and compute expensive so I would upload to Azure in a CSV (comma or pipe delimited). These are 'easier' for Polybase to deal with and you will see improved performance. Remember that external tables are just files being read by the data warehouse engine, don't expect them to perform as well as native tables. You can write your own code to insert directly from on prem in to Azure DW or use SSIS, many of the old tools work fine and see DW just fine using a standard connection string. – Murray Foxcroft Feb 20 '17 at 16:35
Also see my addition to the answer on using CTAS https://msdn.microsoft.com/en-us/library/mt204041.aspx – Murray Foxcroft Feb 20 '17 at 16:37
The data is in a bucket from another platform. It's not coming from an on prem data warehouse for example and it's already compressed. I can try decrompressing the data once it's on the network. I guess there is no way to decrompress the data in the blob once its there right? – Fastidious Feb 20 '17 at 16:46
Seems I would have to download, unzip and reupload. Sigh. Thanks for the info though! I'll approve this as the answer, which is up the DWU and decompress the data to reduce overhead. – Fastidious Feb 20 '17 at 16:51
Thanks. If you do have huge volumes and slow internet, it could be easier to stand up a VM in Azure and do the decompress there. – Murray Foxcroft Feb 20 '17 at 20:44
Yeah, that's what I did here to get the data in from another cloud bucket that was not Amazon S3. Download to Ubuntu VM, Azcopy over and have loaded all data. But, I guess it's to large of data for 400 DWU. – Fastidious Feb 21 '17 at 02:08
It's unlikely compute is your bottleneck. You only get 1 reader per compressed file. There is some compute overhead, but unless your DW instance is already compute bound, your bottleneck is likely the number of readers. As noted above, DWU400 gives you 32 readers. That means a few thousand files, compressed or not, are funneled through 32 pipes. Almost certainly not pegging compute. Test the same set of files on the highest DWU and RC you can spare and see how that plays out before you start preprocessing the files. – SQLmojoe Feb 24 '17 at 21:51

Why is Polybase slow for large compressed files that span 1 billion records?

1 Answers1