0

I'm using an Azure Synapse Serverless SQL database.

I have a first copy activity that uses Azure SQL tables as sources and an Azure storage gen2, where I store .parquet files as sink. From these .parquet files, I use CETAS to create External tables in my Serverless SQL DB (for the context: I'm working with several Azure SQL databases, so these external tables will allow me to write cross db queries). In other words, this Serverless SQL DB is my ODS database.

I have then a second copy activity that identifies incremental changes in the source tables (using the SYS_CHANGE_VERSION of the corresponding CHANGETABLE of the source DB). This second copy activity also outputs .parquet files.

In the end, I have 2 parquet files: 1 with the full content of the source table + 1 with the content to insert or update. External tables in a Serverless SQL DB are only metadata, so no way to do DML operations on them, so my question would be: is there a way to "merge" my 2 parquet files into 1 single one (without duplicates of course) that I could use to recreate an updated external table?

Alternatively, I see that I can choose in the copy activity sink the copy method "Upsert" and provide a KeyColumn (PK of my table), but it doesn't work, saying that: "Message=INSERT operation is not allowed for this type of table." (which looks normal as the associated sink dataset is pointing on my external table, that is read only)

copy activity: copy activity

Any idea on how to solve this? thanks!

Aswin
  • 4,090
  • 2
  • 4
  • 16
  • This doesn’t translate completely but may help if only a little bit … https://stackoverflow.com/questions/70569760/azure-synapse-serverless-delete-files-created-by-a-create-external-table-as-sele/70572424#70572424 – Skin Aug 26 '23 at 03:37

1 Answers1

0

Since external tables in Azure Synapse Serverless SQL database are read-only, you cannot use the Upsert copy method to update the external table directly.

  • If the full load file and incremental files are in the same folder in ADLS, then while creating the external tables you can give ** in place of filename.

Sample external table script:

CREATE  EXTERNAL  TABLE [dbo].[external_table] (
[PK] nvarchar(4000),
[name] nvarchar(4000),
[ingestion_time] nvarchar(4000)
)
WITH (
LOCATION = '<folder-name>/**',
DATA_SOURCE = <datasource-name>,
FILE_FORMAT = <fileformatname>
)

This script will make sure to combine all the data under that folder.

  • When the old data loaded in the initial version got replaced with new data, It is necessary to take only the latest record.

Example, File1 data:

PK,name,ingestion_time
1,Karikala,2023-05-01
4,Kalyani,2023-05-01
7,Sindhu,2023-05-01

File2 data:

PK,name,ingestion_time
1,Aadhi,2023-06-01

Here, File2 data has the updated record for PK=1. When the query select * from external_table is executed, all four records will be displayed. Thus, Create a view on top of this external table to display only the latest record for each primary key.

Sample Query:

with cte as(
SELECT *,RANK() over (partition  by PK order  by [ingestion_time] desc) as  Rank  FROM [dbo].[external_table])
select PK,name,[ingestion_time] from cte where  rank=1
PK name ingestion_time
1 Aadhi 2023-06-01
2 Kalyani 2023-05-01
3 Sindhu 2023-05-01

This query returns the rows with the highest ingestion_time value for each PK value in the external_table.

Aswin
  • 4,090
  • 2
  • 4
  • 16
  • Many thanks @Aswin for your detailed answer! I would have preferred avoiding this solution, because: 1/ I have 53 tables to track change (meaning 53 separated subfolders to be created in my ADLS) 2/ That means 53 different copy activities to create (no global wildcard path possible) 3/ The 53 external tables will need to be dropped and recreated at each run 4/ The 53 views will also need to be dropped and recreated at each run – Jean-Christophe Aug 28 '23 at 07:56
  • However, I did not think adding an "Ingestion time" column to the external tables, then use a view to filter most recent rows, which is very smart! I will see if I couldn't solve the 1/ and 2/ by using a specific naming convention for the parquet files, and set a wildcard path like this: LOCATION = '/*_my-table-name.parquet' Not sure that this will work for creating 53 separate external tables – Jean-Christophe Aug 28 '23 at 07:57
  • 1/ I have 53 tables to track change (meaning 53 separated subfolders to be created in my ADLS) 2/ That means 53 different copy activities to create (no global wildcard path possible) -- **You can use for-each activity to copy the 53 tables from source to ADLS in parquet format. Yes you need to create external table script manually for all 53 tables. I guess, you are doing the same in your approach** – Aswin Aug 28 '23 at 08:07
  • 3/ The 53 external tables will need to be dropped and recreated at each run -- **No, External tables are not required to be dropeed and recreated at each run. Because external tables do not store any data. It just reads the data from ADLS files. Thus, without dropping and recreating data, table will be refreshed with new data.** – Aswin Aug 28 '23 at 08:09
  • 4/ The 53 views will also need to be dropped and recreated at each run --**Again, It is not required to drop and recreate the views. Because views are just created on top of external table and whenever external tables are refreshed, views will give the latest data** – Aswin Aug 28 '23 at 08:11
  • However, I did not think adding an "Ingestion time" column to the external tables, then use a view to filter most recent rows, which is very smart! --**I added ingestion time in my input considering that field will be considered as a key for loading the incremental records. Since you are also loading data incrementally, perhaps the data will have some incremental column with which the only delta records are copied on each run. Use those field in place of i-ngestion_time.** – Aswin Aug 28 '23 at 08:13
  • Hi @Aswin, I think that I managed to do it! I used your method by adding a "LastUpdate" column to my external tables. It's been very helpful, thanks a lot! – Jean-Christophe Aug 28 '23 at 15:24
  • Hi @Aswin, me again. I tried using the query you proposed to rank and filter the full vs incremental rows in a view: CREATE VIEW myView AS WITH cte AS( SELECT *, RANK() OVER (PARTITION BY myPK ORDER BY LastUpdate DESC) AS RankFilter FROM myExternalTable ) SELECT * FROM cte WHERE RankFilter=1 but in the created view, the columns order is completely random. I don't understand why. Any idea on how to preserve the columns order using SELECT * ? I cannot define each column because this query is included in a foreach loop with variable table names. I need to remain generic – Jean-Christophe Sep 01 '23 at 08:05
  • the columns order is completely random-- **Could you share the sample data and show the data of rank column?** – Aswin Sep 01 '23 at 08:21
  • Hi @Aswin, not sure what sample data I can provide? Regarding the rank column, it currently always shows the value "1" and it's expected. Most of the time, the incremental dataset only contains inserts and very rarely updates. In the end, the data of the view is correct, but columns are disordered – Jean-Christophe Sep 01 '23 at 08:50
  • Okay. Why does the column order important? – Aswin Sep 01 '23 at 08:52
  • Is there any requirement? – Aswin Sep 01 '23 at 08:52
  • 1
    actually, it's a "nice to have". The created view will be used by my team mates to run other ad-hoc queries and they might be disappointed by the columns order as they are already familiar with the source tables. Don't worry, I will continue searching, or let it as it is :) thanks! – Jean-Christophe Sep 01 '23 at 09:08
  • here is what I found in Microsoft doc: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-views#limitations :( – Jean-Christophe Sep 01 '23 at 09:13
  • 1
    Hi @Aswin, just to let you know that I did it by ranking/filtering the duplicate rows directly while creating the External table. This avoids the creation of views, and the columns order is preserved! Again, thank you for your great support! – Jean-Christophe Sep 01 '23 at 11:12