2

I generate an ORC table (compresssed w/ Snappy) with Spark (Databricks) on an Azure Storage Account (w/ ADLS Gen2 feature). This ORC represent about 12 GB of data (1.2 billions lines). This table has 32 columns.

Once it's generated, I load this file inside an Internal table within Synapse Analytics table using Polybase.

Here my results with different configuration :

  • DW100c / smallrc = 3h52
  • DW400c / smallrc = 1h50
  • DW400c / xlargerc = 1h58
  • DW1000c / xlargerc = 0h50
  • DW1500c / xlargerc = 0h42

When I look at Storage Account ingress/egress, I saw activity during a few minutes (maybe for copying the ORC files between Synapse nodes) ...... then Synapse resources begin to be stressed. I saw CPU activity for a while then memory increase slowly, slowy, ...

Here memory (red) and CPU max % (blue) example :

enter image description here

Do I need to scale up again ? I don't think this is a pb of network througput. Or maybe a configuration problem ? In regard of Polybase I doesn't understand why this is so slow. Polybase is suppose to ingest TB of ORC data quickly !

BR, A.

Edit: DWU usage

enter image description here

alxsbn
  • 340
  • 2
  • 14
  • Can you post your code? – GregGalloway Apr 14 '20 at 01:38
  • @GregGalloway it's SELECT INTO FROM an external table. We are using Managed Service identity (SCOPE CREDENTIAL) on an EXTERNAL DATA SOURCE. My EXTERNAL FILE FORMAT is using ORC and Snappy – alxsbn Apr 14 '20 at 06:38
  • so no CAST functions? What are data types? Did you choose the minimum string widths possible? – GregGalloway Apr 14 '20 at 06:42
  • We are using NVARCHAR with max values since this is unstructured data – alxsbn Apr 14 '20 at 14:28
  • meaning you are importing JSON data or something that may be arbitrary length and you will parse it later in a stored proc? It may be worth a test to profile the max actual length and see if shrinking the width of the string columns will improve loading performance. I’m worried that the internal data movement buffers will be sized to allow wide data and will perform worse than regularly sized columns. At least that is a common best practice in Azure DW that is well documented. – GregGalloway Apr 14 '20 at 14:35
  • It's arbitrary. This is unstructured big data use case. Do have reference about performance linked to column length ? – alxsbn Apr 14 '20 at 18:13
  • https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-best-practices#use-the-smallest-possible-column-size – GregGalloway Apr 14 '20 at 19:21

1 Answers1

0

There are a couple of things you can try. Azure Synapse Analytics (formerly known as Azure SQL Data Warehosue) has a concept of readers and writers tied to the DWU. I can't find a current version of this documentation, but some old gen1 docs I have indicates DWU1500 has 120 readers. This strongly suggests you should split your one big file up into many files.

I would do some experiments, starting at 10, ie 10 files of 1.2GB each and work up until you find an optimal setting for your workload. I would say I have not tested this with ORC files not it's not clear to me if the ORC format is already inherently partitioned. Try it and see.

You can also try CTAS (if you're not already using it). This will also take advantage of Synapse's ability to parallelise work.

There is also a new feature currently in preview called COPY INTO. According to the documentation it is compatible with ORC files and does not require you to split them:

What is the file splitting guidance for the COPY command loading Parquet or ORC files? There is no need to split Parquet and ORC files because the COPY command will automatically split files. Parquet and ORC files in the Azure storage account should be 256MB or larger for best performance.

https://learn.microsoft.com/en-us/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest#what-is-the-file-splitting-guidance-for-the-copy-command-loading-parquet-or-orc-files

COPY INTO test_orc
FROM 'https://yourAccount.blob.core.windows.net/yourBlobcontainer/folder1/*.orc'
WITH (
    FILE_FORMAT = yourFileFormat
    CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)

Work out if you are DWU-bound by viewing the DWU usage in the portal - see if it's maxed out / flatlined, which I guess it isn't.

wBob
  • 13,710
  • 3
  • 20
  • 37
  • When I'm looking at this (old but still open), plus the fact than Polybase doens't handle partition by, you can realy think than Polybase is'nt Hadoop compatible (which was explicitly write on the documentation). I wil try some tests and reply my results. Thanks ;) – alxsbn Apr 13 '20 at 15:14
  • Oh and for DWU it's not flat (same process / time lapse) and follow memory usage pattern = I add the screenshot in my post – alxsbn Apr 13 '20 at 16:12