Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Delta Table Merge Operation logs Output is not correct number of updated records?

I am performing merge operation on my delta table in spark. I have existing delta table , it already has some records. Now I created another dataframe of csv file, and added one new record and updated one records in that. Please check below…
0
votes
1 answer

Copy table in Databricks

How can one copy delta table in databricks from one environment1 to environment2 ? CREATE TABLE environment2table SHALLOW CLONE environment1table LOCATION dbfs:/environment2table/
0
votes
0 answers

python library deltalake "crashes" but no message of error how can I find out what is wrong

I have a piece of code that does not run but I do not know why. It fails somewhere in the library but it seems there is no error thrown so I do not know what goes wrong: the pseudo code: print('test') try: df =…
user180146
  • 895
  • 2
  • 9
  • 18
0
votes
1 answer

Special characters missing in target while copying data with azure data factory from sql server

In sql server, we have special characters in data like ~$534 but after copying to delta lake with ADF some special characters are missing in the target
bigdata techie
  • 147
  • 1
  • 11
0
votes
1 answer

hive TBLPROPERTIES equivalent to parquet files in pyspark

I'm converting hql scripts to pyspark. HQL code : show tblproperties tblName ('transient_lastDdlTime') I want "transient_lastDdlTime" property equivalent for parquet files. I know there is a way for delta tabes using delta lake APIs, but is there a…
0
votes
1 answer

How to allow sql request to clients on Azure datalake

I use Azure datalake gen 2, I transform data with databricks and I have delta tables which are sent in Power BI. But the clients have to be allowed to request in sql my tables. What is the best practice ? Is it possible with databricks or have I to…
0
votes
1 answer

Delta Lake Deletes in AWS Glue

We are trying to delete data from a delta lake using a AWS Glue Job. Please suggest why the merge condition is not working for delete. This works fine if my delete_condition is like changes.flag = True However it is not performing any deletes if…
Reema
  • 57
  • 3
0
votes
2 answers

Deltalake - Merge is creating too many files per partition

I have a table that ingests new data every day through a merge. I'm currently trying to migrate from ORC to Delta file format and I stumbled through a problem when processing the following simple Merge operation: DeltaTable …
Ismail H
  • 4,226
  • 2
  • 38
  • 61
0
votes
1 answer

How to add complex logic to updateExpr in a Delta Table

I am Updating a Delta Table with some incremental records. Two of the fields require just a plain update, but there is another one which is a collection of maps which I would like to concatenate all the existing values instead of doing a…
Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94
0
votes
0 answers

Difference in counts on a Delta Table immediately after write operation

I've a Databricks job the writes to a certain Delta Table. After the write has been completed, the job is calling another function that reads and calculates some metrics(counts to be specific) on top of the same Delta table. But the counts/metrics…
0
votes
2 answers

How to add a column and a batch_Id value to a delta table using a running pyspark streaming job?

I'm trying to add a batch Id for each row in the current batch run and then write it to a delta table. A batch in my case is one CSV file with multiple values. I generate my batch Id value with a function. I can successfully add the correct batch Id…
Johan
  • 1
  • 1
0
votes
1 answer

Pyspark Delta Lake Write Performance (Spark driver stopped)

I need to create a Delta Lake file containing more than 150 KPIs. Since we have 150 calculations we roughly had to create around 60 odd data frames. Finally, the individual data frames are joined as one final data frame. This final data frame has…
Prakazz
  • 421
  • 1
  • 8
  • 21
0
votes
0 answers

Last few tasks of Spark job are taking long time to complete?

I was doing UNION ALL two table in Data bricks Spark Scala. But when I was writing the resultant dataframe into a delta table it is so much of time to complete. Below is the code spinet I was do. val df1=spark.sql( s""" select nb.*, facl.id as…
venkat
  • 111
  • 1
  • 1
  • 11
0
votes
1 answer

pyspark dataframe databricks - issues with the performance - write to JDBC

I need to write pyspark dataframe to the Azure SQL database. Df has 300 000 000 records and the jdbc connector is not able to do it in a short time. Dataframe is a select from delta table and join with SQL lookups. What I've done: partition delta…
inspiredd
  • 195
  • 2
  • 11
0
votes
1 answer

optimizing synapse delta lake table not reducing the number of files

I created a simple synapse delta lake table via: CREATE TABLE IF NOT EXISTS db1.tbl1 ( id INT NOT NULL, name STRING NOT NULL ) USING DELTA I've merged rows of data into it multiple times such that I now see a total of 15 parquet files in the…
siddarfer
  • 162
  • 1
  • 13