4

I am currently employed as a Junior Data Developer and recently saw a post saying that Azure Synapse can now create SQL tables from Delta tables. I tried creating an SQL table from a Delta table which is inside a Delta lake Storage V2, but the table is being populated with extra redundant data (all the data from all snapshots in the folder) when using 'PARQUET' as a file format and wildcard to read the files.

I tried creating an external file format for my table but Synapse doesn't accept 'DELTA' as a datatype. I used 'PARQUET' as a file format and used VACUUM on my Delta table to keep only the latest snapshot of it. Whenever I set the path to a specific file or once there was only a single snappy.parquet file in the Delta table, data was printed properly.

Basically is there any way to create a Synapse Table/External Table that get it's data from a Delta table?If not is there any way to stop Azure Deltalake from creating a new snapshot every time new data is written/updated/deleted?

Script used:

IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = SynapseParquetFormat') 
CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] 
WITH ( FORMAT_TYPE = PARQUET)
GO

IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'ExtSource') 
CREATE EXTERNAL DATA SOURCE [ExtSource] 
WITH (
    LOCATION   = '*', 
)
GO

CREATE EXTERNAL TABLE dbo.ext_table (
    [CostCentre] varchar(8000),
    [CostCentre_MemberId] int
)
WITH (
    LOCATION = 'dimensions/Dim_Example/*.snappy.parquet',
    -- WILDCARD IF THERE IS ONLY ONE FILE OR LATEST FILE NEEDS TO BE SPECIFIED
    DATA_SOURCE = [ExtSource],
    FILE_FORMAT = [SynapseParquetFormat]
)
GO

/* '*' used because clients' data paths were used */

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
Kieran Goodlip
  • 41
  • 1
  • 1
  • 3

5 Answers5

2

Query Delta Lake format in serverless Synapse SQL pool is currently in public preview.

Note: This preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.

This feature is in public preview and there are some known issues and limitations. Review the known issues on Synapse serverless SQL pool self-help page.

For more details, refer to Query Delta Lake files (preview) using serverless SQL pool in Azure Synapse Analytics.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
1

Currently, there is no DELTA-format in the Azure Synapse Dedicated SQL Pool for external tables. You cannot create a table within a SQL Pool that can read the Delta-format. Even though you can solve your problem with a PARQUET-format and use Vacuum, as you mentioned, it's not a recommended solution for everyday data-operations. From Databricks documentation:

"We do not recommend that you set a retention interval shorter than 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If vacuum cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when vacuum deletes files that have not yet been committed." https://docs.databricks.com/delta/delta-utility.html

And to answer your second question, I don't know any solution to stop creating snapshots. One of the key features of Delta Lake is to provide these snapshots.

A suggestion from my side is to use either Data Factory or a Spark application for data movement, reading from Delta-tables and writing to a table in Dedicated SQL Pool.

  1. With Data Factory you have built in connector for Delta-tables, but you'll need a Databricks-cluster to connect and read the data with Data Factory. Use either Copy Activity or Mapping Data Flow to read from Delta and write to a SQL Pool. Alternatively, read from Delta, write to Parquet and create external table in SQL Pool.

  2. The second option is to use a Spark-application. Read data from delta-table into a Spark DataFrame and write it to the SQL Pool.

Hosting of the Spark application could be done in either Azure Databricks or Spark-pools for Azure Synapse Analytics. For Azure Databricks there's an Azure Synapse Connector, more info here: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html#language-python

Dholmgren
  • 11
  • 4
1

Delta is not supported in Synapse. You can vote for this item here: https://feedback.azure.com/forums/307516-azure-synapse-analytics/suggestions/41120794-delta-lake-support-to-sql-on-demand

I serverless pool there are some workarounds:

You can use this powershell script to generate a DeltaLake view based on the current data: https://github.com/JocaPC/tsql-delta-lake-view/blob/main/GenDeltaLakeView.ps1

Note that you will need to re-generate the view if the data changes.

As an alternative, you can create delta view based on manifest file: https://giangnt2.medium.com/directly-reading-delta-lake-data-from-azure-synapse-8b911dce2bcd

Again you will need to re-generate it if something changes.

Jovan MSFT
  • 13,232
  • 4
  • 40
  • 55
  • Can you update your answer and mention @cheekatlapradeep’s update. And can you clarify you are just discussing Synapse Serverless SQL? – GregGalloway Jul 14 '22 at 01:34
0

The latest version of Dedicate SQL pool is supporting DELTA format. Please refer to the below links for creating them,

CREATE EXTERNAL DATA SOURCE --> https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver16

CREATE EXTERNAL FILE FORMAT --> https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-ver16

CREATE EXTERNAL TABLE --> https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver16

The above links will help you to create required things to work with Delta tables in Azure Synapse.

Hope this helps you:)

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 02 '22 at 15:53
  • Delta is not supported by Dedicated SQL pool yet. "Applies only to serverless SQL pools in Azure Synapse Analytics and SQL Server 2022 (16.x) Preview." Please, refer https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-ver16&tabs=delta – Everton Oliveira Oct 14 '22 at 14:55
  • @EvertonBarciela -- and so far, giving a table `delta` format is necessary for sql update commands to work. Unfortunately, you have to `drop table` each time: the "overwrite" doesn't "overwrite" .... yet. I'll bet they fix that. – Doug_Ivison Jan 03 '23 at 19:55
  • @EvertonBarciela -- Update: delta is now supported in dedicatedp1. In a sql create table statement, include `USING DELTA` or in a pyspark write method, include `.format("delta")`. Example: `%%pyspark` `import com.microsoft.spark.sqlanalytics` `datafrm1 = spark.read.table("default.myInputTable")` `datafrm1.write.mode("overwrite").format("delta").synapsesql("dedicatedp1.dbo.myOutputTable")` – Doug_Ivison Jan 03 '23 at 19:58
  • 1
    Hi @Doug_Ivison, thanks for the update. Can you please share the documentation where it shows it's now supported for SQL dedicated pool? According to the MS documentation, it's still not supported. https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=azure-sqldw-latest&tabs=delimited – Everton Oliveira Jan 04 '23 at 12:34
  • @EvertonBarciela -- doing a quick search, I found this example posted by Sidney Cirqueira ("Azure Synapse Support Engineer / Azure Synapse Influencer") -- https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/writing-data-using-azure-synapse-dedicated-sql-pool-connector/ba-p/3535930 – Doug_Ivison Jan 05 '23 at 14:51
  • @EvertonBarciela -- especially note the explanation in these comment lines, from his first example: `# If 'Constants.SERVER' is not provided, the '' from the three-part table name argument` , `# to 'synapsesql' method is used to infer the Synapse Dedicated SQL End Point.` – Doug_Ivison Jan 05 '23 at 14:53
  • @Doug_Ivison, thanks for providing the link. I think perhaps you misunderstood the requirement. The link you posted is to load data into Synapse Dedicated SQL Pool, using a connector Synapse Spark Pools. What we're discussing here and that is not available yet, is the ability to query delta tables from Synapse SQL Dedicated Pool (directly), in the same way you can do using Synapse SQL Serverless. – Everton Oliveira Jan 06 '23 at 11:28
0

While it is still not supported by default, Mark Pryce-Maher created a T-SQL based script which can read the Delta Format and could thus be used for ingesting data from Deltatables into dedicated SQL Pools:

https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/reading-delta-lake-in-dedicated-sql-pool/ba-p/3571053

https://github.com/microsoft/Azure_Synapse_Toolbox/tree/master/TSQL_Queries/Delta%20Lake

jugi
  • 622
  • 7
  • 15