Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Error when trying to read a Delta Table [Py4JJavaError]

I have created my very first Delta table using a Notebook in Azure Synapse. I am now trying to read it but I am getting an error. Here is the code I have written (I have masked some of the information) : df =…
HamidBee
  • 187
  • 1
  • 7
0
votes
0 answers

Spark Structured Streaming Delta lake schema change

We are currently utilizing Delta as our data lake, with Spark applications utilizing its tables as sources and destinations in Spark streaming. All of this is deployed within a Kubernetes cluster, and we persist checkpoint data in Spark to handle…
0
votes
0 answers

is it possible to specify max file size while writing in delta-lake table

is it possible to set the max file size while want to write in the delta-lake table without using databricks functionality? just with using delta.io Possibilities.
Hamed Tamadon
  • 69
  • 1
  • 7
0
votes
1 answer

How to have booth ZOEDER and VORDER on a single table

I have a table called Human and I want to apply optimization technique of Zorder on Region and Area with VORDER on the whole table. Can anyone suggest a sample code or explain what is Predator as per the documentation OPTIMIZE…
0
votes
0 answers

Setting options for MERGE command in Delta Lake

In the DataFrameWriter API we can set options to enable certain Delta features. For example: spark.range(1).write.format("delta").option("mergeSchema", True).mode("append").save("/tmp/test_delta") However, when building a MERGE command, I couldn't…
Zohar Meir
  • 585
  • 1
  • 4
  • 16
0
votes
1 answer

Why would someone use a Delta Lake over a dedicated SQL Pool?

From what I've read so far I have come to the following conclusion. A dedicated SQL Pool could do everything a Delta Lake could like ACID transactions, scaling capabilities, handle batch and streaming data etc. so what is are the differences between…
Jay2454643
  • 15
  • 4
0
votes
0 answers

external delta table in databricks

How delta table stores data in adls gen 2 , lets say if we have 100 versions and each version has update only on few records does that mean it will save entire snapshot. lets say if current/latest snapshot is 10 mb and 20 records gets updated so it…
0
votes
0 answers

Delta table Write Error - cannot assign instance of java.lang.invoke.SerializedLambda

I have an existing dataproc cluster with spark version 3.3 As per the doc https://docs.delta.io/latest/releases.html, Deltalake version 2.3 is compatible with spark 3.3. Hence followed below steps to install deltalake downloaded Deltalake jar to…
0
votes
0 answers

Deltalake - Setup Error - java.lang.NoClassDefFoundError: io/delta/storage/LogStore

I have an existing dataproc cluster with spark version 3.3 As per the doc https://docs.delta.io/latest/releases.html, Deltalake version 2.3 is compatible with spark 3.3. Hence followed below steps to install deltalake Configuration on…
0
votes
2 answers

Databricks save dataframe without creating subfolder

I'm trying to save a dataframe in Databricks using this code: df.write.format("delta").save("abfss://my_container@my_storage.dfs.core.windows.net/my_path/filename.snappy.parquet") But this creates additional subfolder filename.snappy.parquet, so…
0
votes
1 answer

synapses serverless pool expose delta table commit version

Is it possible to view the latest delta table commit version via a Synapse serverless pool? I require this column downstream for incremental loads. This is easy to retrieve via Spark with the option readChangeFeed, however, I would like to expose…
0
votes
1 answer

Spark sql throws error while reading CDF enabled DELTA table in Azure databricks

I am trying to run the below query in python notebook inside azure databricks tab ='db.t1' df =spark.sql(f"SELECT MAX(_commit_version) as max_version FROM table_changes({tab},0)") df.first()["max_version"] But it throws error as…
Surender Raja
  • 3,553
  • 8
  • 44
  • 80
0
votes
0 answers

Concurrent update delta table

I'm trying to figure out how to create good concurrency-proof delta table design. To simulate that i've created following code snippet: import tempfile import threading from pyspark.sql import SparkSession spark =…
Szymson
  • 63
  • 7
0
votes
1 answer

Databricks error when trying to create Delta Table

I am running the following commands in a Databricks Notebook, I get an error at COMMAND 6. The only thing I can think of is somewhow I have not set up the dataframe correctly at the start, but I have specified it as delta. COMMAND 1: sales_df =…
Ram Nathan
  • 13
  • 4
0
votes
1 answer

Kafka stream in DATABRICKS increases a lot of data

When I perform a Kafka write stream to a table in Databricks, the incoming data doesn't increase the table size significantly, but it results in a much larger increase in the data size on Blob storage. val kafkaBrokers="" val…