Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
1
vote
2 answers

google dataproc - image version 2.0.x how to downgrade the pyspark version to 3.0.1

Using dataproc image version 2.0.x in google cloud since delta 0.7.0 is available in this dataproc image version. However, this dataproc instance comes with pyspark 3.1.1 default, Apache Spark 3.1.1 has not been officially released yet. So there is…
Rak
  • 196
  • 2
  • 9
1
vote
0 answers

Databricks CDC (Change Data Capture) implementation

I am new to databricks and wants to implement incremental loading in databricks reading and writing data from Azure blob storage. I came across CDC method in Databricks. I am saving the data in delta format and also creating tables while writing the…
Pardeep Naik
  • 99
  • 1
  • 12
1
vote
1 answer

Azure Databricks : Mount delta table used in another workspace

Currently I have an azure databricks instance where I have the following myDF.withColumn("created_on", current_timestamp())\ .writeStream\ .format("delta")\ .trigger(processingTime=…
FEST
  • 813
  • 2
  • 14
  • 37
1
vote
0 answers

Delta Lake: How to merge with both SCD type 2 and automatic schema evolution enabled?

The Delta Lake documentation states that to use automatic schema evolution, one has to stick with updateAll() and insertAll() methods when using Delta merge i.e. can't use sub-expressions/conditions to change column values…
Jibby
  • 11
  • 2
1
vote
1 answer

Read file path from Kafka topic and then read file and write to DeltaLake in Structured Streaming

I have a use case where the file path of the json records stored in s3 are coming as a kafka message in kafka. I have to process the data using spark structured streaming. The design which I thought is as follows: In kafka Spark structured…
1
vote
0 answers

Databricks ConvertToDelta - Parquet table to Delta - "AssertionError: assertion failed: File name collisions found"

sampleDf = spark.createDataFrame([(1, 'A', 2021, 1, 5),(1, 'B', 2021, 1, 6),(1, 'C', 2021, 1, 7),],['msg_id', 'msg', 'year', 'month', 'day']) sampleDf.show() sampleDf.write.format("parquet").option("path",…
1
vote
1 answer

Read the last delta partition without read all the delta

I need to read automatically a delta file and I need to read only the last partition that was created. All the delta is big. The delta is partitioned by yyyy and mm val df = spark.read.format("delta").load("url_delta").where(s"yyyy=${yyyy} and…
AFC
  • 29
  • 5
1
vote
1 answer

Am I creating a Bronze or a Silver table?

From what I understand Bronze table in Delta Lake architecture represents the raw and (more or less) unmodified data in a table format. Does this mean that I also shouldn't partition the data for the Bronze table? You could see partitioning as…
trallnag
  • 2,041
  • 1
  • 17
  • 33
1
vote
1 answer

databricks delta lake file extension is parquet

may I know is it correct that the delta lake file extension is *.snappy.parquet ?? I use the code df.write.format('delta').save(blobpath) any one has the idea ???
mytabi
  • 639
  • 2
  • 12
  • 28
1
vote
2 answers

How can I read data from delta lib using SparkR?

I couldn't find any reference to access data from Delta using SparkR so I tried myself. So, fist I created a Dummy dataset in Python: from pyspark.sql.types import StructType,StructField, StringType, IntegerType data2 =…
Alex
  • 73
  • 1
  • 8
1
vote
0 answers

How to use COPY INTO command?

I'm currently trying to implement a number of delta lake tables for storing our 'foundation' data. These tables will be build from delta data ingested into our 'raw' zone in our data lake in the following…
Glyngineer
  • 31
  • 2
1
vote
1 answer

Better Approach than If

I have some Optional parameters in my application and I have to mount command with these parameters, so, now I'm using IF class MergeBuilder(mergeBuilderConfig: MergeBuilderConfig) { def makeMergeCommand(): DeltaMergeBuilder = { val…
1
vote
1 answer

Registering a cloud data source as global table in Databricks without copying

Given that I have a Delta table in Azure storage: wasbs://mycontainer@myawesomestorage.blob.core.windows.net/mydata This is available from my Databricks environment. I now wish to have this data available through the global tables, automatically…
1
vote
1 answer

External Table on DELTA format files in ADLS Gen 1

We have number of databricks DELTA tables created on ADLS Gen1. and also, there are external tables built on top each of those tables in one of the databricks workspace. similarly, I am trying to create same sort of external tables on the same…
Shankar
  • 571
  • 14
  • 26
1
vote
0 answers

merge into deltalake table updates all rows

i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table. I tried…