Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

How to wait for df.write() to be complete

In my pyspark notebook, I read from tables to create data frames aggregate the data frame write to a folder create a SQL table from the output folder For #1. I do `spark.read.format("delta").load(path)` #2. I do `df =…
n179911a
  • 125
  • 1
  • 8
0
votes
0 answers

Can Databricks Dolly be trained on Databricks tables for generating insights using prompts?

I'm exploring the capabilities of Databricks specifically Databricks Dolly, and I'm wondering if it's possible to train Dolly on Databricks tables to generate insights by writing prompts. I have a Databricks environment set up and I'm working with…
0
votes
2 answers

Json Column in delta table databricks

I have a view in databricks and in one column the content is a json,…
Amdbi
  • 11
  • 3
0
votes
1 answer

Medallion structure in MS Fabric

I have to implement a delta lake using MS fabric and I don't know how to implement the medallion structure in this new tool, any ideas or help? This is the official documentation.
willy sepulveda
  • 159
  • 1
  • 14
0
votes
0 answers

How can I optimize writing BIG files?

I have a dataframe with 120 millions of records and when i try to write it, takes 45 minutes. The DF is partitioned by a field "id_date" which is in this format: yyyymmdd. This DF is a delta table in databricks. I try with autooptimize, compact, etc…
tempo
  • 23
  • 4
0
votes
1 answer

An Approach to Handle the Multiple Updates on Delta Table in Pyspark by Multiple Users at the same time

I Have a Pypsark notebook that reads the data from SQL table and merge the changes into the Final layer.It works fine when it is triggers by a single user. But throws an error "Concurrent Update on the STG table has failed" when Multiple users…
0
votes
0 answers

assign `generated always as identity` values in Delta tables programmatically

Is there any way to programmatically access the mechanism used in Databricks to assign Delta table identity columns, so that you can assign the values to a Dataframe pre-insert? Is this available in an io.delta class anywhere or is it something…
wrschneider
  • 17,913
  • 16
  • 96
  • 176
0
votes
0 answers

Delta lake Spark - whenNotMatchedInsert() do nothing

We have a generic delta lake merge code where it has both update and insert in a single execute statement I have some situation where I need to update the values in table (insert should be blank) I need to insert the values in table ##only…
Joe
  • 47
  • 7
0
votes
1 answer

Got "java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.FileSourceOptions$" when spark-submit to Amazon EMR

I have a Spark application. My build.sbt looks like name := "IngestFromS3ToKafka" version := "1.0" scalaVersion := "2.12.17" resolvers += "confluent" at "https://packages.confluent.io/maven/" val sparkVersion = "3.3.1" libraryDependencies ++=…
Hongbo Miao
  • 45,290
  • 60
  • 174
  • 267
0
votes
0 answers

Running into delta.exceptions.ConcurrentAppendException even after setting up S3 Multi-Cluster Writes environment via S3 Dynamo DB LogStore

My use-case is to process a dataset worth 100s of partitions in concurrency. The data is partitioned, and they are disjointed. I was facing ConcurrentAppendException due to S3 not supporting the “put-if-absent” consistency guarantee. From Delta Lake…
0
votes
0 answers

Removing old versions of a delta file with python on blob storage

I'm trying to use the vacuum command on a delta file located in Azure blob storage, which is accessed through Databricks. However, when I run the following code, the old versions of the file are not being removed: path =…
euh
  • 319
  • 2
  • 11
0
votes
0 answers

Change column name in backend delta file not just delta table

Need to change column name in adls data files in delta format. Online research is referring to renaming columns in delta table. I created delta table on the top of existing delta file and did rename column - It is renamed only in hive metastore ,…
Mohan Rayapuvari
  • 289
  • 1
  • 4
  • 18
0
votes
0 answers

Append only new data from Event Hub in Scala

New data is being pushed to Event Hub frequently and I wish to read these updates, apply transformations (joins, select..) to those and then update an already existing delta table. Currently I am only working with a non streaming DataFrame with just…
0
votes
1 answer

Read just the latest version of delta file on blob storage in Azure Data Factory

I have a delta file (consisting of meta data and fragmented parquet files) that I save with databricks to Azure Blob Storage. Later, I am trying to read that file with Azure Data Factory Pipeline but when using copy activity it reads all the data in…
euh
  • 319
  • 2
  • 11
0
votes
1 answer

No Data Returned From Delta Table Although Delta Files Exist

I created a delta table in Databricks using sql as: %sql create table nx_bronze_raw ( `Device` string ) USING DELTA LOCATION '/mnt/Databricks/bronze/devices/'; Then I ingest data (device column) into this table using: bronze_path =…