Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

0 answers

Unpartitioned data has persisted in S3 after adding partition to delta table

I have a delta table with a destination in S3 created as such: ( df .write .mode('append') .option('path', 's3://) ) The data was originally written unpartitioned and I sought to add a…

asked Feb 23 '22 at 18:14

TomNash

3,147
2
21
57

votes

0 answers

Py4JJavaError: An error occurred while calling o490.execute

I am getting the below error when trying to run the merge command. Error does not have enough information about what's the cause of the error. I don't see any failed tasks in the Spark UI. Any suggestions on how to debug this issue? Command I am…

python apache-spark pyspark databricks delta-lake

asked Feb 23 '22 at 16:27

VE88

votes

1 answer

Getting Error : is not a delta table in Databricks

The Question is: Complete the writeToBronze function to perform the following tasks: Write the stream from gamingEventDF -- the stream defined above -- to a bronze Delta table in path defined by outputPathBronze. Convert the (nested) input column…

python pyspark databricks delta-lake

asked Feb 22 '22 at 12:41

Sivaani N

votes

0 answers

does data bricks delta table maintains column addition or deletion versions?

I have a use case where the table columns will be changing [addition/deletion] at each refresh [currently its weekly refresh]. It is stored as delta format. is there any way that can we trach the version of these column addition/deletion like a kind…

apache-spark pyspark delta-lake

asked Feb 17 '22 at 12:53

MJ029

votes

0 answers

Spark - Not able to recognize delta table, but its a delta table

I am trying to use spark 3.2.0, and delta-Core-1.1.0.jar, and am getting following error org.apache.spark.sql.AnalysisException: default.testData is not a Delta table. This is how I am saving the dataset: tableDataset.write().option("path",…

apache-spark delta-lake

asked Feb 14 '22 at 09:06

Daksh

votes

1 answer

How to dynamically add new columns with the datatypes to the existing Delta table and update the new columns with values

Scenario: df1 ---> Col1,Col2,Col3 -- which are the columns in the delta table df2 ---> Col1,Col2,Col3,Col4,Col5 -- which are the columns in the latest refresh table How to get the new columns (in the above Col4,Col5) with datatypes…

python databricks azure-databricks delta-lake

asked Feb 13 '22 at 14:41

skp

votes

1 answer

Can't create a delta lake table in hive

I'm trying to create a delta lake table in hive 3. I moved the delta-hive-assembly_2.11-0.3.0.jar to the hive aux directory and set into the hive cli SET hive.input.format=io.delta.hive.HiveInputFormat; SET…

hive delta-lake

asked Feb 11 '22 at 14:30

M. Alexandru

votes

2 answers

Error writing a partitioned Delta Table from a multitasking job in azure databricks

I have a notebook that writes a delta table with a statement similar to the following: match = "current.country = updates.country and current.process_date = updates.process_date" deltaTable = DeltaTable.forPath(spark,…

apache-spark pyspark databricks azure-databricks delta-lake

asked Feb 11 '22 at 05:00

Daniel Vera

votes

2 answers

Different clusters Spark Structured Streaming from delta file on cluster A to cluster B

I am trying to stream a delta table from cluster A to Cluster B, but I am not able to load or write data to a different cluster: streamingDf = spark.readStream.format("delta").option("ignoreChanges", "true") \ …

apache-spark pyspark spark-structured-streaming delta-lake

asked Jan 28 '22 at 20:53

Juan Camilo Calero Espinosa

votes

0 answers

Databricks Delta Tables Exception thrown in awaitResult when creating a table

We run our ETL jobs on Databricks Notebooks which we will execute via Azure Data Factory. We use the Delta table format, and register all tables in Databricks. We use databricks runtime 7.3 with scala 2.12 and spark 3.0.1. In our jobs we first DROP…

java apache-spark pyspark databricks delta-lake

asked Jan 24 '22 at 12:53

Aaron Brinker

votes

1 answer

Delta Table / Athena And Spark

I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 4…

apache-spark pyspark amazon-emr amazon-athena delta-lake

asked Jan 19 '22 at 20:49

Josefina Andrea Araya Tapia

votes

0 answers

delta write with databricks job is not working as expected

I wrote the following code in python: val input = spark.read.format("csv").option("header",true).load("input_path") input.write.format("delta") .partitionBy("col1Name","col2Name") .mode("overwrite") .save("output_path") input…

apache-spark databricks delta-lake

asked Jan 11 '22 at 14:46

scalacode

1,096
1
16
38

votes

1 answer

Delta lake write optimization

I am writing data to Delta lake which is partitioned. dataset is around 10GB. Currently it is taking 30 minutes to write to s3 bucket. df.write.partitionBy("dated").format("delta").mode("append").save("bucket_EU/temp") How can i better optimize…

amazon-s3 pyspark apache-spark-sql databricks delta-lake

asked Jan 07 '22 at 14:45

code_bug

votes

1 answer

Merge with Multiple Conditions in DeltaTable using Pyspark

I built a process using Delta Table to upsert my data with the ID_CLIENT and ID_PRODUCT key but I am getting the error: Merge as multiple source rows matched Is it possible to perform the merge with multiple…

python apache-spark pyspark delta-lake

asked Dec 23 '21 at 21:14

Bruno

votes

1 answer

How to handle mergeschema option for differing datatypes in Databricks?

import spark.implicits._ val data = Seq(("James","Sales",34)) val df1 = data.toDF("name","dept","age") df1.printSchema() df1.write.option("mergeSchema", "true").format("delta").save("/location") val data2 = Seq(("Tiger","Sales","34") ) var df2 =…

dataframe scala azure-databricks delta-lake

asked Dec 17 '21 at 04:04

boom_clap

Prev 1 2 3

…

81 82 Next