Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Delta Lake MERGE INTO statement

I'm trying to run Delta Lake MERGE INTO MERGE INTO sessions USING updates ON sessions.sessionId = updates.sessionId WHEN MATCHED THEN UPDATE * WHEN NOT MATCHED THEN INSERT * I'm getting an SQL error ParseException: mismatched input 'MERGE'…
AlexV
  • 3,836
  • 7
  • 31
  • 37
0
votes
1 answer

Move delta lake files from one storage to another

I need to move my delta lake files to a new blobstore on a different subscription. Any ideas whats the best way to do this? Im moving them to an ADLS Gen2 Storage, I think the previous storage was just blob storage. This delta lake is updated on an…
Jong
  • 1
  • 2
0
votes
3 answers

How to make use of Delta Lake on a regular Scala project on IDE

I've added the delta dependencies in my build.sbt libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-hive" % sparkVersion, //…
Animesh
  • 176
  • 1
  • 1
  • 10
0
votes
1 answer

Update array of structs - Spark

I have the following spark delta table structure, +---+------------------------------------------------------+ |id |addresses | +---+------------------------------------------------------+ |1 …
mani_nz
  • 4,522
  • 3
  • 28
  • 37
0
votes
1 answer

Delta Lake Spark compaction after merge operation gives 'DeltaTable' object has no attribute '_get_object_id' error

I am doing delta-lake merge operation using python api and pyspark . After doing the merge operation I call the compaction operation but the compaction gives the following error: Error: File…
priyansh jain
  • 41
  • 2
  • 4
0
votes
2 answers

Delete a row from target spark delta table when multiple columns in a row of source table matches with same columns of a single row in target table

I want to update my target Delta table in databricks when certain column values in a row matches with same column values in Source table. The problem is when I have multiple rows in source table that matches one row in target Delta table. This is a…
0
votes
1 answer

Counting unique values on grouped data in a Spark Dataframe with Structured Streaming on Delta Lake

everyone. I have a structured streaming in a Delta Lake. My last table is supposed to count how many unique IDs access a platform per week. I`m grouping the data by week in the streaming, however, I cannot count the unique values of IDs on the other…
0
votes
1 answer

Spark Update Multiple Columns in Delta from another table

I am trying to update multiple columns from one delta table based on values fetched from another delta table. The update sql below works in Oracle but not in Spark Delta, can you please help? deptDf = sqlContext.createDataFrame( [(10, "IT",…
SWDeveloper
  • 319
  • 1
  • 4
  • 14
0
votes
1 answer

Spark merge (replace) on key containing multiple rows

I am using Apache Spark and I would like to merge two DataFrames, one containing existing data and the other one containing (potential) updates. The merge is supposed to happen on a given number of key attributes, however, for one set of key…
Rocreex
  • 160
  • 7
0
votes
1 answer

Can't insert string to Delta Table using Update in Pyspark

I have encountered an issue were it will not allow me to insert a string using update and returns. I'm running 6.5 (includes Apache Spark 2.4.5, Scala 2.11), but it is not working on 6.4 runtime as well. I have a delta table with the following…
Jon
  • 4,593
  • 3
  • 13
  • 33
0
votes
2 answers

delta writestream .option("mergeSchema", "true") issue

I have a delta table of 3 columns with data. Now, I have an incoming data with 4 columns so the DF.writeStream has to update the data location atleast with 4 columns automatically, so we can recreate the table on the top of the data location. hence…
0
votes
1 answer

Delta Lake: File Not Found Exception

I am using Delta Lake to perform merge operation, for which I am trying to convert my Parquet files to delta format which are partitioned over time: val source = spark.read.parquet("s3a://data-lake/source/") source .write …
Vishal
  • 1,442
  • 3
  • 29
  • 48
0
votes
1 answer

Strange requests searching for _delta_log when using custom FileFormat on databricks

I observe very strange requests issued by databricks when using custom file format. I need to implement a custom FileFormat to read binary files in spark sql. I've implemented the FileFormat class (implementations is mostly a copy/paste from…
0
votes
1 answer

what is spark.databricks.delta.snapshotPartitions configuration used for in delta lake?

I was going through delta lake and came across a configuration spark.databricks.delta.snapshotPartitions however not quite sure what this is used for? Can't find this in delta lake documentation as well. In delta lake github found below code, but…
ravi malhotra
  • 703
  • 5
  • 14
0
votes
2 answers

Problem combining delta.io and spark-bigquery 0.15.x-beta

I am trying to update my code to the new spark-bigquery connector to 0.15.{0,1}-beta and I got that delta format is not working anymore. I cannot read or write using delta form. Here you can find a minimal example for writing a dataframe using delta…