0

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file.

I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.

However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).

For that, I need to first create a default/blank "Delta" table in Azure data lake folder i.e. for example :- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.

I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.

For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.

create database delta_dwh;

create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );

create external file format deltalakeformat 
with (format_type = delta);

drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
    location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
    data_source = deltalakestorage,
    file_format = deltalakeformat
);

However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:

select isnull(max(product_skey), 0) 
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows

Any thoughts, inputs, suggestions ??

Thanks!

ManiK
  • 377
  • 1
  • 21
  • Your question is quite intresting. But how do you want to save data in delta file from adf? If i'm not mistaken you need for this databrick cluster. – Arkadiusz Łukasiewicz Oct 06 '21 at 20:51
  • Data will be saved using Azure Mapping Data Flow visual transformation interface which behind the scenes runs on a "Spark" cluster and which allows "Delta" as both Source and Sink Types (Inline). This Azure Mapping Data Flow is then used/added as an Activity in ADF. – ManiK Oct 07 '21 at 10:08

2 Answers2

0

You may try to create initial /dummy data_flow + pipiline to create this empty delta files.

It's only simple workaround.

  1. Create CSV with your sample table data.
  2. Create dataflow with name =initDelta
  3. Use this CSV as source in data flow
  4. In projection panel set up correct data types.
  5. Add filtering after source and setup dummy filter 1=2 etc.
  6. Add sink with delta output.
  7. Put your initDelta dataflow into dummy pipeline and run it.
  8. Folder structure for delta should created.

You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.

I think it should work or I missed something in your problem

Arkadiusz Łukasiewicz
  • 6,241
  • 1
  • 11
  • 17
  • Yes, the workaround is simple what you've described. But am specifically looking for doing it via Synapse Serverless SQL Pool as shown above in my post. Although, it's still under preview but documentation only talks about querying from Synapse serverless pool, but no method/way of creating a Delta structure. So, it seems that Delta tables could "only" be created through "Spark" either Mapping Data Flow or Databricks. And, Azure SQL Serverless pool can "only query" a Delta table "as of now";- but not able to create it. – ManiK Oct 08 '21 at 11:46
0

I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 07 '22 at 05:09