Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Filesystem does not exist when reading delta lake using Synapse pipeline

We are trying to implement a new solution where we use a Synapse pipeline to copy data from an Azure Databricks Delta Lake to a Synapse dedicated pool. I'm using a copy activity where the source is a dynamically constructed query to be executed…
0
votes
1 answer

Synapse Delta tables - reading the latest version of a record

In Synapse Delta tables are great for ETL because they allow merge statements on parquet files. I was wondering how to use delta table to our advantage for reading the data as well when we load Silver to Gold. is there a way, in Synapse, to read the…
david
  • 7
  • 2
0
votes
0 answers

Error in querying history of delta lake table

I have created an delta lake table with the following code: %%pyspark df = spark.read.load('abfss://email address removed for privacy reasons/data/MoviesDB.csv', format='csv' , header=True ) delta_table_path =…
Ellen
  • 1
  • 3
0
votes
0 answers

Z order on a non partitioned databricks delta table and incremental data addition results in full table Z order

I have a databricks delta table which is around 400GB and is non partitioned(databricks recommends not to partition if size < 1 TB), this table is target of a streaming pipeline. I have Z ordered this table based on occurence_dttm(data type is…
0
votes
3 answers

How to access the latest version of Delta table as Integer?

Can someone let me know to convert the following from a Dataframe to an InterType() df = DeltaTable.forPath(spark, '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1').history().select(max(col("version")).alias("version")) I have tried…
Patterson
  • 1,927
  • 1
  • 19
  • 56
0
votes
2 answers

Writing WHERE clause on Delta Lake History Tables

I am trying to query the Delta Lake table history as described at the following link https://learn.microsoft.com/en-us/azure/databricks/delta/history When I describe the delta table as follows describe history…
Patterson
  • 1,927
  • 1
  • 19
  • 56
0
votes
0 answers

Force Delta Lake SQL interpretation on plain insert

Background I am converting a (postgre)SQL team over to running on Spark with Delta Lake by providing a sql runner in scala that picks up and runs their files in a conventional order of directories and files. My goal is to make it possible to convert…
combinatorist
  • 562
  • 1
  • 4
  • 17
0
votes
1 answer

How to Expose Delta Table Data or Synapse Table to Rest API Service using Java/python

We would like to expose the data from delta table or synapse table to REST API service which is java based or either python language. kindly provide sample for java or python to kick start my new implementation. Thanks
Developer KE
  • 71
  • 1
  • 2
  • 14
0
votes
1 answer

Direct copying data from Azure Databricks Delta Lake is only supported when sink is a folder, please enable staging or fix File path

Currently, I am trying to copy data from Azure Databricks Delta Lake to my Azure Data Lake through Azure Data Factory. I want to copy to a dynamic directory and a dynamic file name, but I keep receiving this error "Direct copying data from Azure…
0
votes
2 answers

Read existing delta table with Spark SQL

Is used a little Py Spark code to create a delta table in a synapse notebook. partial code: # Read file(s) in spark data frame sdf = spark.read.format('parquet').option("recursiveFileLookup", "true").load(source_path) # Create new delta table with…
Joost
  • 1,873
  • 2
  • 17
  • 18
0
votes
0 answers

Azure databricks restore data from specific dates

Azure data-bricks option("mergeSchema", "true") can easily handle new changes coming from Source. I have one real time issue to restore data in real time streaming from specific dates. At present we are managing new column addition and data is…
anuj
  • 124
  • 2
  • 13
0
votes
1 answer

Where clause in delta table time travel

I am trying to get a customer's first_name before it was updated in a Databricks delta table. However, I am getting ParseException while trying SQL below. select c_first_name from customer_table where c_customer_sk=1 version as of 0 The workaround…
soumya-kole
  • 1,111
  • 7
  • 18
0
votes
1 answer

How we can set table properties for delta table in pyspark using DeltaTable API

Below is the code that I am trying in PySpark from delta import DeltaTable delta_table = DeltaTable.forPath(spark, delta_table_path) delta_table.logRetentionDuration = "interval 1 days" After this do we need to save this config or it will be…
YOGESH
  • 43
  • 5
0
votes
0 answers

_delta_log/*.*' cannot be listed for external table created from azure serverless

CREATE EXTERNAL TABLE test_table ( bd datetime2(7) NULL, name1 nvarchar(max) NULL, last nvarchar(max) NULL ) WITH ( LOCATION = 'abfss://hhrr@xxxxx.dfs.core.windows.net/bronze/ext/test_table', DATA_SOURCE =…
mxendire
  • 41
  • 1
  • 6
0
votes
1 answer

Unable to write data to delta table for the first time to specifed location eventhought using .format("delta")

I am trying to write the data to delta table sink(ADLS Gen2 mounted to databricks) and I am getting the below error, Please note that I am writing the data for the first time to the delta table and no table/folder was created manually for that…