Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

Error when trying to read a Delta Table [Py4JJavaError]

I have created my very first Delta table using a Notebook in Azure Synapse. I am now trying to read it but I am getting an error. Here is the code I have written (I have masked some of the information) : df =…

pyspark azure-synapse delta-lake

asked Aug 12 '23 at 21:29

HamidBee

votes

0 answers

Spark Structured Streaming Delta lake schema change

We are currently utilizing Delta as our data lake, with Spark applications utilizing its tables as sources and destinations in Spark streaming. All of this is deployed within a Kubernetes cluster, and we persist checkpoint data in Spark to handle…

apache-spark spark-streaming spark-structured-streaming delta-lake

asked Aug 12 '23 at 11:49

dhia Gharsallaoui

votes

0 answers

is it possible to specify max file size while writing in delta-lake table

is it possible to set the max file size while want to write in the delta-lake table without using databricks functionality? just with using delta.io Possibilities.

delta-lake

asked Aug 12 '23 at 10:51

Hamed Tamadon

votes

1 answer

How to have booth ZOEDER and VORDER on a single table

I have a table called Human and I want to apply optimization technique of Zorder on Region and Area with VORDER on the whole table. Can anyone suggest a sample code or explain what is Predator as per the documentation OPTIMIZE…

databricks azure-service-fabric delta-lake apache-synapse microsoft-fabric

asked Aug 11 '23 at 06:59

Vishnu Basskar V

votes

0 answers

Setting options for MERGE command in Delta Lake

In the DataFrameWriter API we can set options to enable certain Delta features. For example: spark.range(1).write.format("delta").option("mergeSchema", True).mode("append").save("/tmp/test_delta") However, when building a MERGE command, I couldn't…

apache-spark pyspark delta-lake

asked Aug 10 '23 at 12:42

Zohar Meir

votes

1 answer

Why would someone use a Delta Lake over a dedicated SQL Pool?

From what I've read so far I have come to the following conclusion. A dedicated SQL Pool could do everything a Delta Lake could like ACID transactions, scaling capabilities, handle batch and streaming data etc. so what is are the differences between…

azure-synapse delta-lake

asked Aug 10 '23 at 09:55

Jay2454643

votes

0 answers

external delta table in databricks

How delta table stores data in adls gen 2 , lets say if we have 100 versions and each version has update only on few records does that mean it will save entire snapshot. lets say if current/latest snapshot is 10 mb and 20 records gets updated so it…

azure-databricks delta-lake

asked Aug 09 '23 at 09:43

Ishan Razdan

votes

0 answers

Delta table Write Error - cannot assign instance of java.lang.invoke.SerializedLambda

I have an existing dataproc cluster with spark version 3.3 As per the doc https://docs.delta.io/latest/releases.html, Deltalake version 2.3 is compatible with spark 3.3. Hence followed below steps to install deltalake downloaded Deltalake jar to…

apache-spark pyspark delta-lake

asked Aug 09 '23 at 08:59

user16798185

votes

0 answers

Deltalake - Setup Error - java.lang.NoClassDefFoundError: io/delta/storage/LogStore

apache-spark pyspark delta-lake

asked Aug 08 '23 at 10:47

user16798185

votes

2 answers

Databricks save dataframe without creating subfolder

I'm trying to save a dataframe in Databricks using this code: df.write.format("delta").save("abfss://my_container@my_storage.dfs.core.windows.net/my_path/filename.snappy.parquet") But this creates additional subfolder filename.snappy.parquet, so…

pyspark databricks azure-databricks delta-lake azure-data-lake-gen2

asked Aug 07 '23 at 01:51

archjkeee

votes

1 answer

synapses serverless pool expose delta table commit version

Is it possible to view the latest delta table commit version via a Synapse serverless pool? I require this column downstream for incremental loads. This is easy to retrieve via Spark with the option readChangeFeed, however, I would like to expose…

azure-synapse delta-lake azure-synapse-analytics

asked Aug 07 '23 at 01:08

Will W

votes

1 answer

Spark sql throws error while reading CDF enabled DELTA table in Azure databricks

I am trying to run the below query in python notebook inside azure databricks tab ='db.t1' df =spark.sql(f"SELECT MAX(_commit_version) as max_version FROM table_changes({tab},0)") df.first()["max_version"] But it throws error as…

python apache-spark-sql azure-databricks delta-lake

asked Aug 04 '23 at 08:15

Surender Raja

3,553
8
44
80

votes

0 answers

Concurrent update delta table

I'm trying to figure out how to create good concurrency-proof delta table design. To simulate that i've created following code snippet: import tempfile import threading from pyspark.sql import SparkSession spark =…

pyspark delta-lake

asked Aug 01 '23 at 18:04

Szymson

votes

1 answer

Databricks error when trying to create Delta Table

I am running the following commands in a Databricks Notebook, I get an error at COMMAND 6. The only thing I can think of is somewhow I have not set up the dataframe correctly at the start, but I have specified it as delta. COMMAND 1: sales_df =…

delta-lake databricks-sql

asked Aug 01 '23 at 13:54

Ram Nathan

votes

1 answer

Kafka stream in DATABRICKS increases a lot of data

When I perform a Kafka write stream to a table in Databricks, the incoming data doesn't increase the table size significantly, but it results in a much larger increase in the data size on Blob storage. val kafkaBrokers="" val…

scala databricks delta-lake data-engineering

asked Aug 01 '23 at 13:38

Berkay Babataş

Prev 1 2 3

…

81 82 Next