How to use CTAS with concurrent queries in ADW

Question

Everywhere in the docs, Microsoft recommends us to always use CTAS when possible.

I am building a data ingestion process to ADW. This process may receive data that we want to delete from the destination table. However, in order to use CTAS we have to rename the table.

But what if there is a long running query in that table. I have noticed that the rename locks and waits until the query ends.

That makes things difficult, because when I have a 1 bilion rows table, a sql delete performs terribly, and a rename gets locked until the query finishs.

Is there a better way to go around this problems?

Yes, but in the scenario I have, the updates em deletes could be almost anywhere in the table. So I wouldnt know which partition to switch and most of the times I would have to switch many partitions. — Flavio Pegas, Dec 05 '19 at 15:27
So I decied to try the full table switch, but I went through the same locking problem that I faced with rename, and that makes sense. In my case, where I have to switch many partitions, partition switching would also have to wait the locks in the table isn't it? — Flavio Pegas, Dec 05 '19 at 15:34

Flavio Pegas · Accepted Answer · 2019-12-10T23:01:23.047

None of the options that microsoft recommends helped on my requirements (Rename, Table Switch and Partition Switch).

This is mainly because I want the Data Warehouse containing the important information as soon as I can, also I can't wait until the completion of every query running in the table to finnally do the ingestion. When in production, it is expected to have lots of long running queries.

Because DW has very limited delete options on the actual DELETE operation, I have decided to go around these limitations by changing the table in a way that I could satisfy the requirements of the DELETE operation.

So in order to accomplish that, I have decided to do the following:

Add a new column to all tables, in which this column is a hash of the "PK" columns of the table (The ones that together, bring unicity to the rows)
Create all tables with the distribution on the hash column
On the clustered columnstore tables, order the clustered columnstore index by its "PK" columns
All the delete operations in the ingestion process would be just like below:

DELETE FROM [{schemaName}].[{tableName}] WHERE [{hashColumnName}] in ( SELECT [{hashColumnName}] FROM {stagingTableName} )

With this changes I managed to achieve satisfatory ingestion speed for the daily incoming data. Of course it is way slower than any of the microsoft recommendations when ingesting big loads of data.

In my tests, I could achieve 450k rows per minute in a DW 300, and there are still lots of improvements to be done in the ingestion code.

Hope I have helped!

score 0 · Answer 2 · answered Dec 17 '19 at 23:28

After a while, I found this link: UPDATE FROM in Azure SQL DW?

I didn't know that I could use delete just like below and with this, the solution ends up way easier because I don't need to create a hash column anymore.

DELETE a
WHERE EXISTS (
        SELECT TOP 1 1
        FROM b AUX
        WHERE AUX.b2 = a.a2
        )

I don't know how well this will perform.

How to use CTAS with concurrent queries in ADW

2 Answers2