Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

What is the difference between a data lakehouse and a delta lake?

I am new to Databricks. I am reading Microsoft documentation on data lakehouse. In the documentation they make reference to delta lake without explaining what the difference is or even if there is any. Can someone please help explain this to me. Any…
Jay2454643
  • 15
  • 4
0
votes
0 answers

Performance optimization of a Pyspark code which uses hive external table query multiple times

I am having a Pyspark code which applies complex transformations. In this code, we are using one particular hive external table multiple times,to be precise the subset data using partitioned column from the table multiple times.. Now if I save the…
Ananth
  • 41
  • 4
0
votes
0 answers

VACUUM operation in Synapse Notebook

I'm trying to execute VACUUM in Synapse notebook, and I'm using this code: spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false") #get list of all tables table_list = spark.sql("show tables from…
coding
  • 135
  • 2
  • 9
0
votes
0 answers

Power BI - Direct Query. Comparing two tables in a delta table

I want to create a view in which 3 tables or matrix are shown. Table 1 and 2 have slicers which allow the user to select a "Forecast Cycle" on each. I would like the 3rd table to show the delta's in values between the corresponding rows in table 1…
Lee S
  • 17
  • 6
0
votes
1 answer

How can we optimize our refresh of powerBI dashboards which are using DirectQuery to read data from delta tables built on ADLS

we are facing slow performance with power bi DIRECT QUERY reading the data from delta lake in Azure Databricks We are trying to refresh some PBI dashboards which are reading data from Delta tables. Models are created in power BI Desktop…
0
votes
0 answers

Verify if dependencies are added during spark session creation

I'm trying to run simple code on Dataproc Jupyter notebook to write data to delta table. Code is working fine in Python notebook, but when I run the same code on Jupyter notebook running into issues while writing to delta. Upon further debugging,…
0
votes
1 answer

Delta table optimization manual VS. auto

I have some delta format files need to optimize regularly. According to this doc write-conflicts-on-databricks, OPTIMIZE explicitly can cause conflicts in some cases like UPDATE. Meanwhile, with the latest Delta functionality, we can set the table…
QPeiran
  • 1,108
  • 1
  • 8
  • 18
0
votes
1 answer

AWS Glue optimization DPU

Im using AWS Glue job run auto scaling for the number of workers. After analysing a few metrics of job glue run, Ive figured out that the job is using the MaxNumberOfWorkers always, also havig the auto scale active. Is any way for optmize the job? I…
0
votes
1 answer

Loading Parquet and Delta files into Azure Synapse using ADB or Azure Synapse?

I have a below case scenario. We are using Azure Databricks to pull data from several sources and generate the Parquet and Delta files and loaded them into our ADLS Gen2 Containers. We are now planning to create our data warehouse inside Azure…
0
votes
0 answers

How to handle Using an existing Spark session; only runtime SQL configurations will take effect. in pyspark

As part of pytest, I'm trying to load the deltata lake extensions into the spark session like below @pytest.fixture(scope="function") def spark_session(request): # spark =…
Santosh Hegde
  • 3,420
  • 10
  • 35
  • 51
0
votes
1 answer

How to update table location in Databricks on Azure from wasb to abfss

I created a table including location such as:  wasb://@.blob.core.windows.net/foldername  We have updated access to storage accounts to use abfss  I am trying to execute the following command:  alter table…
jalazbe
  • 1,801
  • 3
  • 19
  • 40
0
votes
0 answers

Files getting missed while Processing Incremental File using Autoloader during Initial Load

In my RAW i have around 33000 files which holds the historical data that i need to process as part of Initial Load. And daily there will be 10 new files will be coming in the landing zone. For this i have designed my code with autoloader for initial…
sayan nandi
  • 83
  • 1
  • 6
0
votes
1 answer

Implementing SCD Type 2 with Delta Lake

I have a requirement to implement a SCD Type 2 in my Delta Tables. The Scenario is as under. Source Table Columns are: --------------------------------------------------- state. Code. Name. value. …
Pysparker
  • 93
  • 11
0
votes
0 answers

Databricks merge with multiple source rows matched

On Databricks I have a target table which look's like below data = [("0", "null", "female", "USA", True,"insert"), ("1", "1", "male", "USA", True,"insert"), ] tdf = spark.createDataFrame(data, ["Pk", "fk", "c1", "c2", "c3","changetype"])…
0
votes
0 answers

How to use whenNotMatchedBySourceDelete with subquery?

I have a simple Delta Lake table (my_table) that has three columns: col1 - the "primary key" col2 col3 I'm attempting to construct a merge call that accomplishes the following: The Delta Lake transaction log does not get modified if there is no…
zzzz8888
  • 23
  • 3