Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Unpartitioned data has persisted in S3 after adding partition to delta table

I have a delta table with a destination in S3 created as such: ( df .write .mode('append') .option('path', 's3://) ) The data was originally written unpartitioned and I sought to add a…
TomNash
  • 3,147
  • 2
  • 21
  • 57
0
votes
0 answers

Py4JJavaError: An error occurred while calling o490.execute

I am getting the below error when trying to run the merge command. Error does not have enough information about what's the cause of the error. I don't see any failed tasks in the Spark UI. Any suggestions on how to debug this issue? Command I am…
VE88
  • 125
  • 1
  • 5
0
votes
1 answer

Getting Error : is not a delta table in Databricks

The Question is: Complete the writeToBronze function to perform the following tasks: Write the stream from gamingEventDF -- the stream defined above -- to a bronze Delta table in path defined by outputPathBronze. Convert the (nested) input column…
Sivaani N
  • 19
  • 1
  • 8
0
votes
0 answers

does data bricks delta table maintains column addition or deletion versions?

I have a use case where the table columns will be changing [addition/deletion] at each refresh [currently its weekly refresh]. It is stored as delta format. is there any way that can we trach the version of these column addition/deletion like a kind…
MJ029
  • 151
  • 1
  • 12
0
votes
0 answers

Spark - Not able to recognize delta table, but its a delta table

I am trying to use spark 3.2.0, and delta-Core-1.1.0.jar, and am getting following error org.apache.spark.sql.AnalysisException: default.testData is not a Delta table. This is how I am saving the dataset: tableDataset.write().option("path",…
Daksh
  • 148
  • 11
0
votes
1 answer

How to dynamically add new columns with the datatypes to the existing Delta table and update the new columns with values

Scenario: df1 ---> Col1,Col2,Col3 -- which are the columns in the delta table df2 ---> Col1,Col2,Col3,Col4,Col5 -- which are the columns in the latest refresh table How to get the new columns (in the above Col4,Col5) with datatypes…
skp
  • 314
  • 1
  • 14
0
votes
1 answer

Can't create a delta lake table in hive

I'm trying to create a delta lake table in hive 3. I moved the delta-hive-assembly_2.11-0.3.0.jar to the hive aux directory and set into the hive cli SET hive.input.format=io.delta.hive.HiveInputFormat; SET…
M. Alexandru
  • 614
  • 5
  • 20
0
votes
2 answers

Error writing a partitioned Delta Table from a multitasking job in azure databricks

I have a notebook that writes a delta table with a statement similar to the following: match = "current.country = updates.country and current.process_date = updates.process_date" deltaTable = DeltaTable.forPath(spark,…
0
votes
2 answers

Different clusters Spark Structured Streaming from delta file on cluster A to cluster B

I am trying to stream a delta table from cluster A to Cluster B, but I am not able to load or write data to a different cluster: streamingDf = spark.readStream.format("delta").option("ignoreChanges", "true") \ …
0
votes
0 answers

Databricks Delta Tables Exception thrown in awaitResult when creating a table

We run our ETL jobs on Databricks Notebooks which we will execute via Azure Data Factory. We use the Delta table format, and register all tables in Databricks. We use databricks runtime 7.3 with scala 2.12 and spark 3.0.1. In our jobs we first DROP…
Aaron Brinker
  • 21
  • 1
  • 4
0
votes
1 answer

Delta Table / Athena And Spark

I have my delta table, which can be read from Athena. When I try to get the data through a query from spark I get the following error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 4…
0
votes
0 answers

delta write with databricks job is not working as expected

I wrote the following code in python: val input = spark.read.format("csv").option("header",true).load("input_path") input.write.format("delta") .partitionBy("col1Name","col2Name") .mode("overwrite") .save("output_path") input…
scalacode
  • 1,096
  • 1
  • 16
  • 38
0
votes
1 answer

Delta lake write optimization

I am writing data to Delta lake which is partitioned. dataset is around 10GB. Currently it is taking 30 minutes to write to s3 bucket. df.write.partitionBy("dated").format("delta").mode("append").save("bucket_EU/temp") How can i better optimize…
code_bug
  • 355
  • 1
  • 12
0
votes
1 answer

Merge with Multiple Conditions in DeltaTable using Pyspark

I built a process using Delta Table to upsert my data with the ID_CLIENT and ID_PRODUCT key but I am getting the error: Merge as multiple source rows matched Is it possible to perform the merge with multiple…
Bruno
  • 17
  • 3
0
votes
1 answer

How to handle mergeschema option for differing datatypes in Databricks?

import spark.implicits._ val data = Seq(("James","Sales",34)) val df1 = data.toDF("name","dept","age") df1.printSchema() df1.write.option("mergeSchema", "true").format("delta").save("/location") val data2 = Seq(("Tiger","Sales","34") ) var df2 =…
boom_clap
  • 129
  • 1
  • 12