Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Can multiple data pipeline merge data on the same delta table simultaneously without causing inconsistency?

I know ACID Transactions are one of the important feature of the delta lake while performing read and write. Is this also true for merge operation? What if two pipelines try to perform update operation based on the different condition on the same…
Akhilesh Jaiswal
  • 227
  • 2
  • 14
0
votes
1 answer

update the delta table in databricks with adding value to existing columns

I am having a piece of scala code which will take count of signals at 3 different stages with respect to an id_no and an identifier. The output of the code will be as shown…
Antony
  • 970
  • 3
  • 20
  • 46
0
votes
1 answer

Spark Delta Table Add new columns in middle Schema Evolution

Have to ingest a file with new column into a existing table structure. create table sch.test ( name string , address string ) USING DELTA --OPTIONS ('mergeSchema' 'true') PARTITIONED BY (name) LOCATION '/mnt/loc/fold' TBLPROPERTIES…
mehere
  • 1,487
  • 5
  • 28
  • 50
0
votes
1 answer

Conditional Upserting into a delta sink with Azure Data Flow in Azure Data Factory

I have a sink delta in an Azure Data Flow module and the dataframe that I'm using to update it has a hash key for business keys and a hash key for all columns contents. I want to insert new hash business hash keys to the sink and only update already…
ARCrow
  • 1,360
  • 1
  • 10
  • 26
0
votes
2 answers

Vacuuming Delta tables in Databricks does not work

I'm trying to vacuum my Delta tables in Databricks. However, somehow it is not working and I don't understand why. This is causing our storage constantly increasing. I have set the following table properties: %sql ALTER TABLE SET…
0
votes
1 answer

Broadcast Timeout on Azure Databricks Delta Delete

Hi I am trying to delete records from a delta table. It is causing a broadcast timeout error from time to time. Can someone please help with this spark.sql(s"""DELETE FROM stg.bl WHERE concat(key,':',revision) in (Select distinct…
mehere
  • 1,487
  • 5
  • 28
  • 50
0
votes
1 answer

Spark SQL for Databricks Delta tables - Case insensitive string comparison

in Spark SQL, when doing a query against Databricks Delta tables, is there any way to make the string comparison case insensitive globally? i.e. when applying the WHERE clause for the columns I would like to avoid the "lcase" or "lower" function…
Kirk Quinbar
  • 1
  • 1
  • 1
0
votes
1 answer

Cannot Allocate Memory in Delta Lake

Problem The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000…
0
votes
1 answer

How to dynamically pass a variable to delta table updateAll() in python?

We are using delta (.io) for our data lake. Every X hours we want to upsert all record that are new/changed. Our initial code looks like this: from delta.tables import * for table in output_tables.keys(): update_condition = "old." +…
54m
  • 719
  • 2
  • 7
  • 18
0
votes
1 answer

How to find out whether Spark table is parquet or delta?

I have a database with some tables in parquet format and others in delta format. If I want to append data to the table, I need to specify it if a table is in delta format (default is parquet). How can I determine a table's format? I tried show…
M.S.Visser
  • 61
  • 1
  • 6
0
votes
1 answer

Are Databricks SQL tables & views duplicates of the source data, or do you update the same data source?

Let's say you create a table in DBFS as follows. %sql DROP TABLE IF EXISTS silver_loan_stats; -- Explicitly define our table, providing schema for schema enforcement. CREATE TABLE silver_loan_stats ( loan_status STRING, int_rate FLOAT, …
beyondtdr
  • 411
  • 6
  • 17
0
votes
1 answer

What are the benefits of using Hyperspace indexes over Z-ordering in deltaLake?

I have streaming data in Azure Databricks being stored in delta table format. For optimization,I am currently using Z-ordering. Are there any benefits of using Hyperspace indexing subsytem over Z-ordering?
0
votes
0 answers

How to specify file name using dataFrameWriter when saving parquet file in delta lake format

having the following code df.write.format('delta').save(some_path) how can I set the file name e.g foo.parquet ? I tried : df.write .format('delta') .option("filename","foo") .save(some_path) but it didn't work. is it possible ?
igx
  • 4,101
  • 11
  • 43
  • 88
0
votes
1 answer

Can we delete latest version of delta table in the delta lake?

I have a delta table with 4 versions. DESCRIBE HISTORY cfm ---> has 4 versions. 0,1,2,3. I want to delete version 3 or 2. How can I achieve this? i tried from delta.tables import * from pyspark.sql.functions import * deltaTable =…
nl09
  • 93
  • 1
  • 9
0
votes
1 answer

Spark non-descriptive error in DELTA MERGE

I'm using Spark 3.1 in Databricks (Databricks Runtime 8) with a very large cluster (25 workers with 112 Gb of memory and 16 cores each) to replicate several SAP tables in an Azure Data Lake Storage (ADLS gen2). For doing this, a tool is writting the…