Using Polybase to load data into an existing table in parallel

Question

Using CTAS we can leverage the parallelism that Polybase provides to load data into a new table in a highly scalable and performant way.

Is there a way to use a similar approach to load data into an existing table? The table might even be empty.

Creating an external table and using INSERT INTO ... SELECT * FROM ... - I would assume that this goes through the head node and is therefore not in parallel?

I know that I could also drop the table and use CTAS to recreate it but then I have to deal with all the metadata again (column names, data types, distributions, ...).

have you tried using partitioned tables? That would do what you're looking for but it might not be feasible depending on how much data you have. — SQLmojoe, Nov 30 '16 at 23:37
I thought INSERT in Azure SQL DW was fully parallel unlike INSERT in APS. What kind of a performance difference are you seeing vs. CTAS? — GregGalloway, Dec 02 '16 at 00:06
hey Greg, parallel inserts on a single machine (using all CPUs) are different from parallel/distributed inserts from multiple nodes what is what I want to achieve CTAS fully scales with your DTUs/compute nodes but I think having an INSERT statement limits me back to the single control node I did not run any tests yet, I just wanted to know whether its in general possible to get the same performance for INSERTs as for CTASs — Gerhard Brueckl, Dec 02 '16 at 14:50

score 2 · Answer 1 · answered Dec 01 '16 at 00:50

You could use partition switching to do this, although remember not to use too many partitions with Azure SQL Data Warehouse. See 'Partition Sizing Guidance' here.

Bear in mind check constraints are not supported so the source table has to use the same partition scheme as the target table.

Full example with partitioning and switch syntax:

-- Assume we have a file with the values 1 to 100 in it.

-- Create an external table over it; will have all records in
IF NOT EXISTS ( SELECT * FROM sys.schemas WHERE name = 'ext' )
EXEC ( 'CREATE SCHEMA ext' )
GO


-- DROP EXTERNAL TABLE ext.numbers
IF NOT EXISTS ( SELECT * FROM sys.external_tables WHERE object_id = OBJECT_ID('ext.numbers') )
CREATE EXTERNAL TABLE ext.numbers (
    number          INT             NOT NULL
    )
WITH (
    LOCATION = 'numbers.csv',
    DATA_SOURCE = eds_yourDataSource, 
    FILE_FORMAT = ff_csv
);
GO

-- Create a partitioned, internal table with the records 1 to 50
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers

CREATE TABLE dbo.numbers
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED INDEX ( number ), 
    PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
    )
AS 
SELECT * 
FROM ext.numbers
WHERE number Between 1 And 50;
GO

-- DBCC PDW_SHOWPARTITIONSTATS ('dbo.numbers')


-- CTAS the second half of the external table, records 51-100 into an internal one.
-- As check contraints are not available in SQL Data Warehouse, ensure the switch table
-- uses the same scheme as the original table.
IF OBJECT_ID('dbo.numbers_part2') IS NOT NULL DROP TABLE dbo.numbers_part2

CREATE TABLE dbo.numbers_part2
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED INDEX ( number ),
    PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
    )
AS 
SELECT *
FROM ext.numbers
WHERE number > 50
GO


-- Partition switch it into the original table
ALTER TABLE dbo.numbers_part2 SWITCH PARTITION 2 TO dbo.numbers PARTITION 2;


SELECT *
FROM dbo.numbers
ORDER BY 1;

Introducing PartitionSwitching would mean that I have to create a partition scheme/function for each table which involves even more complexity — Gerhard Brueckl, Dec 01 '16 at 13:21
You do not create partition schemes or functions for Azure SQL Data Warehouse. You simply specify them in the table `CREATE` statement, whether that be a simple `CREATE` or `CREATE TABLE AS (CTAS)`. — wBob, Dec 01 '16 at 14:02
Introducing PartitionSwitching would mean that I have to create a partition scheme/function for each table which involves even more complexity Would I would ideally want to do is Insert into an existing table using polybase where polybase derives the column-structure from the existing table without having to recreate a dummy [external] table manually — Gerhard Brueckl, Dec 01 '16 at 14:16
I suppose you could try Azure Data Factory, which has a "use Polybase" option, using the new Copy Wizard it will take a guess at the data types for you. Worth a look. — wBob, Dec 01 '16 at 14:22
we are using ADF at the moment to load our SQL DW and to be honest I did not want to use it for the purpose of simply copying data fomr SQL DW to SQL DW as it could be quite complex and maybe also expensive to do this for 200+ tables. hence we where looking for a simple manual process to do this using Polybase only without introducing any further complexity — Gerhard Brueckl, Dec 02 '16 at 14:31
Have you checked the 'Allow Polybase' checkbox? It will use Polybase under the hood so you keep the parallelism you are looking for. — wBob, Dec 02 '16 at 15:52

Using Polybase to load data into an existing table in parallel

1 Answers1

Linked