Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

(PySpark) Update a delta table based on conditional expression while iterating over a lookup df and extract values to insert from a nested dict?

I have a mapping/lookup table/DF according to which I have to extract values from a highly nested json/dictionary. These values have to be inserted as column values to a delta table. How do I do this leveraging pyspark's parallelism? I know I can…
Takreem
  • 1
  • 2
0
votes
2 answers

How to aggregate over date including all prior dates

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted. | ID | Mrc | insert_dt | |----|-----|------------| | 1 | 40 | 2022-01-01 | | 2 | 30 |…
ddd
  • 4,665
  • 14
  • 69
  • 125
0
votes
0 answers

Keep only latest records by unique column in table

I have a question about delta tables \ delta change data feed \ upsert I have a table that stores all history of states (main), I use ZORDER by uniq column on it, and I retrieve results for 2s with WHERE clause BY (uniq, date) from the table.…
0
votes
1 answer

encryption decryption of streaming delta table in azure databricks

My goal: my goal is to encrypt and decrypt streaming delta table in azure databricks in python. I am trying to get the solution for encryption and decryption of streaming delta table in python so far I am able to achieve column level encryption…
0
votes
1 answer

How do you delete rows in a Delta table using SQL?

I am using Databricks. In my notebook, I have a table (Delta table) and I want to delete all rows where the topic is 'CICD' from my table. I want to use SQL to do it.
Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
0
votes
2 answers

How can we view the column names and other metadata for a [DataBricks] Delta-Table?

A Spark DataFrame has the .columns attribute: dataFrame.columns A DeltaTable does not. Note that the latter is based off a parquet file/directory and parquets are self-describing so the columnar info is available at the least in the files…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
0
votes
0 answers

Delta Tables do not retain column names?

I have started with a DataFrame. Strangely I need to extract the paths for the files to convert it to a DeltaTable. Even more strangely, the column names are lost on the DeltaTable. What is the thinking behind this? Do we need to always pair up…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
0
votes
0 answers

Apply ForEach (Structured Streaming Programming) to move data from one delta table to another

Let's say that I have a delta table saved which was processed using the ForEachBatch to apply transformations and finally saved a final delta table (let's call these table Table1). However for some requeriments the data of this table need to be…
0
votes
1 answer

How to set up Kafka as a dependency when using Delta Lake in PySpark?

This is the code to set up Delta Lake as part of a regular Python script, according to their documentation: import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \ .config("spark.sql.extensions",…
Gustavo Puma
  • 993
  • 12
  • 27
0
votes
0 answers

No operation for SHOW PARTITIONS for a partitioned delta table with Delta version 0.5.0 and Spark 2.4.4

Apache Spark 2.4.4 + io.delta:delta-core_2.12:0.5.0. I have created a fully qualified Delta Table using the below command. CREATE TABLE DB.table_name USING DELTA LOCATION…
Spandana
  • 13
  • 1
  • 1
  • 5
0
votes
1 answer

Databricks - How to get the current version of delta table parquet files

Say I have a table called data and it's some time-series. It's stored like this: /data /date=2022-11-30 /region=usa part-000001.parquet part-000002.parquet Where I have two partition keys and two partitions for the…
VocoJax
  • 1,469
  • 1
  • 13
  • 19
0
votes
0 answers

Set version (VERSION AS OF) dynamically from return of a subquery

We have a business request to compare the evolution in a certain delta table. We would like to compare the latest version of the table with the previous one using Delta time travel. The main issue we are facing is to retrieve programmatically using…
nameziane
  • 1
  • 1
0
votes
0 answers

Databricks merge optimisation where keys are Guids issue

I am importing into a Delta lake external data through an entity that has approximately 20M rows. Not huge but ok. The issue is the key to this data set is a guid, so using the normal Delta Merge operation it is notoriously inefficient as key is in…
0
votes
0 answers

Azure Databricks Delta table Access through API

There is a requirement to expose data present in Azure Databricks delta tables through Rest API. These API will be used by both internal & external users. Query: What are the best possible option available to expose delta table data through Rest API…
0
votes
0 answers

Reading Deltafiles from Azure Datalake to Jupyter Notebook

I am trying to read a Delta-file into a Jupyter Notebook on a notebook that is running on local processor. I have already gained access into Azure Datalake (ADLS), but I struggle to access the Delta-file. I read in this article that this method…