Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

vote

2 answers

google dataproc - image version 2.0.x how to downgrade the pyspark version to 3.0.1

Using dataproc image version 2.0.x in google cloud since delta 0.7.0 is available in this dataproc image version. However, this dataproc instance comes with pyspark 3.1.1 default, Apache Spark 3.1.1 has not been officially released yet. So there is…

asked Feb 08 '21 at 19:52

Rak

vote

0 answers

Databricks CDC (Change Data Capture) implementation

I am new to databricks and wants to implement incremental loading in databricks reading and writing data from Azure blob storage. I came across CDC method in Databricks. I am saving the data in delta format and also creating tables while writing the…

databricks delta-lake

asked Jan 27 '21 at 09:15

Pardeep Naik

vote

1 answer

Azure Databricks : Mount delta table used in another workspace

Currently I have an azure databricks instance where I have the following myDF.withColumn("created_on", current_timestamp())\ .writeStream\ .format("delta")\ .trigger(processingTime=…

apache-spark databricks azure-databricks delta-lake

asked Jan 26 '21 at 16:16

FEST

vote

0 answers

Delta Lake: How to merge with both SCD type 2 and automatic schema evolution enabled?

The Delta Lake documentation states that to use automatic schema evolution, one has to stick with updateAll() and insertAll() methods when using Delta merge i.e. can't use sub-expressions/conditions to change column values…

apache-spark databricks azure-databricks delta-lake

asked Jan 19 '21 at 18:47

Jibby

vote

1 answer

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured…

apache-spark apache-kafka spark-structured-streaming delta-lake

asked Jan 18 '21 at 15:31

Amit Joshi

vote

0 answers

Databricks ConvertToDelta - Parquet table to Delta - "AssertionError: assertion failed: File name collisions found"

sampleDf = spark.createDataFrame([(1, 'A', 2021, 1, 5),(1, 'B', 2021, 1, 6),(1, 'C', 2021, 1, 7),],['msg_id', 'msg', 'year', 'month', 'day']) sampleDf.show() sampleDf.write.format("parquet").option("path",…

apache-spark pyspark databricks azure-databricks delta-lake

asked Jan 07 '21 at 09:23

SriramN

vote

1 answer

Read the last delta partition without read all the delta

I need to read automatically a delta file and I need to read only the last partition that was created. All the delta is big. The delta is partitioned by yyyy and mm val df = spark.read.format("delta").load("url_delta").where(s"yyyy=${yyyy} and…

scala apache-spark databricks delta-lake

asked Dec 23 '20 at 16:38

AFC

vote

1 answer

Am I creating a Bronze or a Silver table?

From what I understand Bronze table in Delta Lake architecture represents the raw and (more or less) unmodified data in a table format. Does this mean that I also shouldn't partition the data for the Bronze table? You could see partitioning as…

databricks delta-lake

asked Dec 14 '20 at 22:32

trallnag

2,041
1
17
33

vote

1 answer

databricks delta lake file extension is parquet

may I know is it correct that the delta lake file extension is *.snappy.parquet ?? I use the code df.write.format('delta').save(blobpath) any one has the idea ???

delta-lake

asked Dec 05 '20 at 10:56

mytabi

vote

2 answers

How can I read data from delta lib using SparkR?

I couldn't find any reference to access data from Delta using SparkR so I tried myself. So, fist I created a Dummy dataset in Python: from pyspark.sql.types import StructType,StructField, StringType, IntegerType data2 =…

r apache-spark databricks delta-lake

asked Dec 01 '20 at 11:55

Alex

vote

0 answers

How to use COPY INTO command?

I'm currently trying to implement a number of delta lake tables for storing our 'foundation' data. These tables will be build from delta data ingested into our 'raw' zone in our data lake in the following…

databricks delta-lake

asked Nov 23 '20 at 11:23

Glyngineer

vote

1 answer

Better Approach than If

I have some Optional parameters in my application and I have to mount command with these parameters, so, now I'm using IF class MergeBuilder(mergeBuilderConfig: MergeBuilderConfig) { def makeMergeCommand(): DeltaMergeBuilder = { val…

scala delta-lake

asked Nov 22 '20 at 17:25

Rubens Sôto

vote

1 answer

Registering a cloud data source as global table in Databricks without copying

Given that I have a Delta table in Azure storage: wasbs://mycontainer@myawesomestorage.blob.core.windows.net/mydata This is available from my Databricks environment. I now wish to have this data available through the global tables, automatically…

apache-spark azure-storage databricks azure-databricks delta-lake

asked Nov 21 '20 at 12:16

casparjespersen

3,460
5
38
63

vote

1 answer

External Table on DELTA format files in ADLS Gen 1

We have number of databricks DELTA tables created on ADLS Gen1. and also, there are external tables built on top each of those tables in one of the databricks workspace. similarly, I am trying to create same sort of external tables on the same…

pyspark azure-databricks delta-lake

asked Nov 11 '20 at 20:01

Shankar

vote

0 answers

merge into deltalake table updates all rows

i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table. I tried…

scala apache-spark delta-lake

asked Nov 06 '20 at 13:14

Frank Dekervel

Prev 1 2 3

…

81 82 Next