Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

How to wait for df.write() to be complete

In my pyspark notebook, I read from tables to create data frames aggregate the data frame write to a folder create a SQL table from the output folder For #1. I do `spark.read.format("delta").load(path)` #2. I do `df =…

pyspark delta-lake

asked Jun 10 '23 at 18:36

n179911a

votes

0 answers

Can Databricks Dolly be trained on Databricks tables for generating insights using prompts?

I'm exploring the capabilities of Databricks specifically Databricks Dolly, and I'm wondering if it's possible to train Dolly on Databricks tables to generate insights by writing prompts. I have a Databricks environment set up and I'm working with…

databricks prompt delta-lake large-language-model databricks-dolly

asked Jun 07 '23 at 08:09

nilesh1212

1,561
2
26
60

votes

2 answers

Json Column in delta table databricks

I have a view in databricks and in one column the content is a json,…

json apache-spark pyspark databricks delta-lake

asked Jun 06 '23 at 15:11

Amdbi

votes

1 answer

Medallion structure in MS Fabric

I have to implement a delta lake using MS fabric and I don't know how to implement the medallion structure in this new tool, any ideas or help? This is the official documentation.

delta-lake microsoft-fabric

asked Jun 05 '23 at 20:44

willy sepulveda

votes

0 answers

How can I optimize writing BIG files?

I have a dataframe with 120 millions of records and when i try to write it, takes 45 minutes. The DF is partitioned by a field "id_date" which is in this format: yyyymmdd. This DF is a delta table in databricks. I try with autooptimize, compact, etc…

apache-spark optimization delta-lake

asked Jun 05 '23 at 10:16

tempo

votes

1 answer

An Approach to Handle the Multiple Updates on Delta Table in Pyspark by Multiple Users at the same time

I Have a Pypsark notebook that reads the data from SQL table and merge the changes into the Final layer.It works fine when it is triggers by a single user. But throws an error "Concurrent Update on the STG table has failed" when Multiple users…

pyspark apache-spark-sql delta-lake

asked Jun 04 '23 at 12:58

TalendDeveloper

votes

0 answers

assign `generated always as identity` values in Delta tables programmatically

Is there any way to programmatically access the mechanism used in Databricks to assign Delta table identity columns, so that you can assign the values to a Dataframe pre-insert? Is this available in an io.delta class anywhere or is it something…

apache-spark databricks delta-lake

asked Jun 02 '23 at 17:16

wrschneider

17,913
16
96
176

votes

0 answers

Delta lake Spark - whenNotMatchedInsert() do nothing

We have a generic delta lake merge code where it has both update and insert in a single execute statement I have some situation where I need to update the values in table (insert should be blank) I need to insert the values in table ##only…

pyspark merge delta-lake

asked Jun 02 '23 at 00:45

Joe

votes

1 answer

Got "java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.FileSourceOptions$" when spark-submit to Amazon EMR

I have a Spark application. My build.sbt looks like name := "IngestFromS3ToKafka" version := "1.0" scalaVersion := "2.12.17" resolvers += "confluent" at "https://packages.confluent.io/maven/" val sparkVersion = "3.3.1" libraryDependencies ++=…

amazon-web-services scala apache-spark amazon-emr delta-lake

asked May 31 '23 at 23:14

Hongbo Miao

45,290
60
174
267

votes

0 answers

Running into delta.exceptions.ConcurrentAppendException even after setting up S3 Multi-Cluster Writes environment via S3 Dynamo DB LogStore

My use-case is to process a dataset worth 100s of partitions in concurrency. The data is partitioned, and they are disjointed. I was facing ConcurrentAppendException due to S3 not supporting the “put-if-absent” consistency guarantee. From Delta Lake…

amazon-s3 pyspark amazon-dynamodb delta-lake

asked May 31 '23 at 19:49

KIRAN REDDY

votes

0 answers

Removing old versions of a delta file with python on blob storage

I'm trying to use the vacuum command on a delta file located in Azure blob storage, which is accessed through Databricks. However, when I run the following code, the old versions of the file are not being removed: path =…

python azure pyspark delta-lake

asked May 29 '23 at 14:06

euh

votes

0 answers

Change column name in backend delta file not just delta table

Need to change column name in adls data files in delta format. Online research is referring to renaming columns in delta table. I created delta table on the top of existing delta file and did rename column - It is renamed only in hive metastore ,…

databricks delta-lake

asked May 26 '23 at 23:49

Mohan Rayapuvari

votes

0 answers

Append only new data from Event Hub in Scala

New data is being pushed to Event Hub frequently and I wish to read these updates, apply transformations (joins, select..) to those and then update an already existing delta table. Currently I am only working with a non streaming DataFrame with just…

scala databricks azure-databricks azure-eventhub delta-lake

asked May 26 '23 at 13:30

Tamás Godányi

votes

1 answer

Read just the latest version of delta file on blob storage in Azure Data Factory

I have a delta file (consisting of meta data and fragmented parquet files) that I save with databricks to Azure Blob Storage. Later, I am trying to read that file with Azure Data Factory Pipeline but when using copy activity it reads all the data in…

azure azure-data-factory delta-lake

asked May 25 '23 at 11:27

euh

votes

1 answer

No Data Returned From Delta Table Although Delta Files Exist

I created a delta table in Databricks using sql as: %sql create table nx_bronze_raw ( `Device` string ) USING DELTA LOCATION '/mnt/Databricks/bronze/devices/'; Then I ingest data (device column) into this table using: bronze_path =…

pyspark azure-blob-storage databricks azure-databricks delta-lake

asked May 24 '23 at 11:33

khidir sanosi

Prev 1 2 3

…

81 82 Next