0

How can we use Azure Synapse serverless SQL pool to query the latest version of Delta Lake table ??

Below link specifies it can be done under Delta Lake, but unable to find any examples:

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand#delta-lake

"You can use serverless SQL pool to query the latest version of Delta Lake"

Thanks!

ManiK
  • 377
  • 1
  • 21

1 Answers1

1

The documentation for querying Delta tables in Synapse Serverless SQL is here. Basically the syntax looks like:

SELECT TOP 10 *
FROM OPENROWSET(
    BULK 'https://sqlondemandstorage.blob.core.windows.net/delta-lake/covid/',
    FORMAT = 'delta') as rows;
Community
  • 1
  • 1
GregGalloway
  • 11,355
  • 3
  • 16
  • 47
  • I am specifically looking for getting the latest version of the delta table which have the last changed snapshot of the table. If the table is updated daily with inserts & updates...I want to view the last/latest snapshot of the delta. above query in serverless pool would give me the entire snapshot of all the changes since here we are pointing to the root delta folder. I am specifically looking for as of timestamp and/or as of version kind of statements like you could run on spark/databricks cluster. I could read the latest version from Azure Mapping Data Flow but need an example using SQL. – ManiK Oct 23 '21 at 06:29
  • in Synapse Serverless SQL pool. – ManiK Oct 23 '21 at 06:37
  • @ManiK the example query I provided will show the most up-to-date information from the delta table. If in Spark with 5 separate statements row A was inserted, row B inserted, row B updated, row A deleted, and then row C inserted, a query against that delta table would return 2 rows (B and C). So it’s the latest version meaning the most up-to-date information. It doesn’t support time travel queries currently (see the link you provided) or the ability to only pull the most recently changed rows (unless that’s indicated by a column you can filter on manually). – GregGalloway Oct 23 '21 at 09:41
  • 1
    @ManiK if you used `FORMAT = 'PARQUET’` against a Delta table folder then you would get a mess back with multiple copies of rows that have updates and you would see deleted rows. – GregGalloway Oct 23 '21 at 09:46
  • agree, we have to use FORMAT='DELTA' not 'PARQUET' as per documentation. However, lately - it turned out to be an issue within our dataflow logic wherein few records inserts/update were messed up and we were to blame DELTA showing incorrect latest version/snapshot. My bad :(. I think if the documentation would have been updated to make it very clear with a one-liner statement that this query always returns the latest version/snapshot of the Delta table, would have been helpful. Although, it does mention that time travel currently isn't supported. Anyways, thanks a lot. – ManiK Oct 25 '21 at 07:23
  • @GregGalloway, going by your example row A was inserted, row B inserted, row B updated, row A deleted and if NO changes done on row D, row E, row F etc.., then serverless SQL query should not return the the unchaged rows but only recently changes rows i.e., only B and C as per your example. Do we have such kind of possibilities? Requirement is to trigger the downstrean pipeline with incremental changes NOT to send entire dataset if only one/few rows get updated/inserted. – 191180rk Jun 02 '23 at 15:50
  • @191180rk I don’t believe that Synapse Serverless SQL supports that functionality currently unless you add a timestamp column or similar to track what has changed yourself. You could use a Spark notebook to use the time travel feature of Delta to get this. – GregGalloway Jun 03 '23 at 16:43