Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

Delta Lake MERGE INTO statement

I'm trying to run Delta Lake MERGE INTO MERGE INTO sessions USING updates ON sessions.sessionId = updates.sessionId WHEN MATCHED THEN UPDATE * WHEN NOT MATCHED THEN INSERT * I'm getting an SQL error ParseException: mismatched input 'MERGE'…

apache-spark delta-lake

asked Jul 14 '20 at 17:33

AlexV

3,836
7
31
37

votes

1 answer

Move delta lake files from one storage to another

I need to move my delta lake files to a new blobstore on a different subscription. Any ideas whats the best way to do this? Im moving them to an ADLS Gen2 Storage, I think the previous storage was just blob storage. This delta lake is updated on an…

azure databricks delta-lake

asked Jul 12 '20 at 14:11

Jong

votes

3 answers

How to make use of Delta Lake on a regular Scala project on IDE

I've added the delta dependencies in my build.sbt libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-hive" % sparkVersion, //…

scala apache-spark apache-spark-sql delta-lake

asked Jul 11 '20 at 13:26

Animesh

votes

1 answer

Update array of structs - Spark

I have the following spark delta table structure, +---+------------------------------------------------------+ |id |addresses | +---+------------------------------------------------------+ |1 …

sql apache-spark apache-spark-sql databricks delta-lake

asked Jun 23 '20 at 03:35

mani_nz

4,522
3
28
37

votes

1 answer

Delta Lake Spark compaction after merge operation gives 'DeltaTable' object has no attribute '_get_object_id' error

I am doing delta-lake merge operation using python api and pyspark . After doing the merge operation I call the compaction operation but the compaction gives the following error: Error: File…

python-3.x apache-spark pyspark delta-lake

asked Jun 10 '20 at 09:29

priyansh jain

votes

2 answers

Delete a row from target spark delta table when multiple columns in a row of source table matches with same columns of a single row in target table

I want to update my target Delta table in databricks when certain column values in a row matches with same column values in Source table. The problem is when I have multiple rows in source table that matches one row in target Delta table. This is a…

pyspark apache-spark-sql databricks azure-databricks delta-lake

asked Jun 08 '20 at 16:48

Saikat

votes

1 answer

Counting unique values on grouped data in a Spark Dataframe with Structured Streaming on Delta Lake

everyone. I have a structured streaming in a Delta Lake. My last table is supposed to count how many unique IDs access a platform per week. I`m grouping the data by week in the streaming, however, I cannot count the unique values of IDs on the other…

pyspark spark-structured-streaming delta-lake

asked May 29 '20 at 12:46

Gilmar Neves

votes

1 answer

Spark Update Multiple Columns in Delta from another table

I am trying to update multiple columns from one delta table based on values fetched from another delta table. The update sql below works in Oracle but not in Spark Delta, can you please help? deptDf = sqlContext.createDataFrame( [(10, "IT",…

apache-spark sql-update delta-lake

asked May 26 '20 at 05:56

SWDeveloper

votes

1 answer

Spark merge (replace) on key containing multiple rows

I am using Apache Spark and I would like to merge two DataFrames, one containing existing data and the other one containing (potential) updates. The merge is supposed to happen on a given number of key attributes, however, for one set of key…

sql apache-spark merge delta-lake

asked May 23 '20 at 13:04

Rocreex

votes

1 answer

Can't insert string to Delta Table using Update in Pyspark

I have encountered an issue were it will not allow me to insert a string using update and returns. I'm running 6.5 (includes Apache Spark 2.4.5, Scala 2.11), but it is not working on 6.4 runtime as well. I have a delta table with the following…

pyspark databricks azure-databricks delta-lake

asked May 18 '20 at 17:48

Jon

4,593
3
13
33

votes

2 answers

delta writestream .option("mergeSchema", "true") issue

I have a delta table of 3 columns with data. Now, I have an incoming data with 4 columns so the DF.writeStream has to update the data location atleast with 4 columns automatically, so we can recreate the table on the top of the data location. hence…

merge schema databricks spark-structured-streaming delta-lake

asked May 16 '20 at 23:12

Lokesh

votes

1 answer

Delta Lake: File Not Found Exception

I am using Delta Lake to perform merge operation, for which I am trying to convert my Parquet files to delta format which are partitioned over time: val source = spark.read.parquet("s3a://data-lake/source/") source .write …

scala apache-spark delta-lake

asked May 16 '20 at 12:05

Vishal

1,442
3
29
48

votes

1 answer

Strange requests searching for _delta_log when using custom FileFormat on databricks

I observe very strange requests issued by databricks when using custom file format. I need to implement a custom FileFormat to read binary files in spark sql. I've implemented the FileFormat class (implementations is mostly a copy/paste from…

apache-spark-sql databricks delta-lake

asked May 15 '20 at 07:41

Vyacheslav Krot

votes

1 answer

what is spark.databricks.delta.snapshotPartitions configuration used for in delta lake?

I was going through delta lake and came across a configuration spark.databricks.delta.snapshotPartitions however not quite sure what this is used for? Can't find this in delta lake documentation as well. In delta lake github found below code, but…

apache-spark delta-lake

asked May 06 '20 at 12:50

ravi malhotra

votes

2 answers

Problem combining delta.io and spark-bigquery 0.15.x-beta

I am trying to update my code to the new spark-bigquery connector to 0.15.{0,1}-beta and I got that delta format is not working anymore. I cannot read or write using delta form. Here you can find a minimal example for writing a dataframe using delta…

scala apache-spark google-bigquery jackson-modules delta-lake

asked May 05 '20 at 10:33

Alfonso Sastre

Prev 1 2 3

…

81 82 Next