Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

How to drop a column from a Delta Table stored in ADLS as Parquet files?

I have a ton of parquet data stored in ADLS with Delta Lake as an abstraction over it. However I've run into an issue where some columns have incorrect datatypes due to using spark's inferSchema, and since Delta Lake will throw errors on mismatched…
ROODAY
  • 756
  • 1
  • 6
  • 23
0
votes
1 answer

How to append records to delta table in foreachBatch?

I am using foreachbatch to write streaming data into multiple targets and its working fine for the first microbatch execution. When it tries to run the second microbatch, it fails with the below error. "StreamingQueryException: Query [id =…
Nikesh
  • 47
  • 6
0
votes
1 answer

IDENTITY column duplication when using BY DEFAULT parameter for Azure Delta tables

Duplicate records are ingested into Azure delta tables when using BY DEFAULT parameter. I have followed below steps. 1)Created table with identity column GENERATED BY DEFAULT 2)Inserted 2 records with id, 1 and 2 3)When inserted third row without…
dileepVikram
  • 890
  • 4
  • 14
  • 30
0
votes
1 answer

How to delete a Partition in unmanaged delta lake table

How to delete a Partition in unmanaged/external delta lake table? val deltaTable = DeltaTable.forName("country_people") val partitionColumn = "country" val partitionValue = "Argentina" // Delete the partition data val deltaTable =…
0
votes
1 answer

How do I retrieve data from a table which is linked with a delta file?

I have a table created as delta table. I deleted the delta file from the azure container but I have the table on databricks database. How can I retrieve the data. I tried reading the file from database but it gives me error as: spark.sql("select *…
Siddhu
  • 19
  • 4
0
votes
1 answer

Data Factory does not copy all rows to SINK destination in DELTA LAKE

I'm going through an unusual situation with Data Factory, I'm copying the fields using the COPY component where the source is a SQL SERVER base and the Destination is a table in TABLE DELTA LAKE format. From the image I put together here, the number…
user3661384
  • 524
  • 7
  • 18
0
votes
1 answer

In Data Lake architecture, is there a way to store or flag bad or unclean data?

We're building out a lakehouse over top an Azure Data Lake Gen 2, and we're interested that either "pulls out" or "flags" bad data from our pipelines. Are there any industry best practices or case studies we can look at to replicate when we go to…
0
votes
1 answer

What version of Delta Lake is supported by Synapse built-in SQL pool

I'm using Synapse Analytics to read from a Data lake containing Delta tables. The Delta tables were written using the latest Delta version. How can I verify what version of Delta is running in the Built-in serverless SQL pool? I tried searching…
Tacti
  • 15
  • 4
0
votes
0 answers

Writing to Delta in constant time - how?

We are seeing that our delta writers that append data to a Delta lake are taking increasingly long time to write their data. For relative small sets of data (single megabytes), write times will eventually be in the range of minutes on an Azure Delta…
Krumelur
  • 31,081
  • 7
  • 77
  • 119
0
votes
0 answers

DataBricks update_postimage and update_preimage position in Change Data Feed (CDF)

I had a doubt regarding the position of update_preimage and update_postimage in the table which contains the row_level changes in CDF . Does every update_preimage has its update_postimage right below it i.e. are they in two adjacent rows? The reason…
0
votes
0 answers

sample data from columns in synapse delta table

I am trying to profile the data in a azure synapse delta table. One of the things I want to collect is a small sample of field values for each column. I can do a column by column version and union the results. e.g. SELECT '_ModifiedDatetime' AS…
david
  • 7
  • 2
0
votes
0 answers

delta live tables aggregations in gold layer

In our DLT gold layer we have some aggregation queries that are live so it computes the whole thing. We would like to make this quicker and use CDF for business level aggregates like below https://www.databricks.com/notebooks/delta-lake-cdf.html We…
0
votes
1 answer

Synpase views error Duplicate column ordinal cannot be provided in WITH schema clause

I am new to synapse and I am trying to create a synapse view from a delta table using openrow set. Getting the error Duplicate column ordinal cannot be provided in WITH schema clause Not sure why I should see this error, I am not using any with…
Heether
  • 152
  • 1
  • 1
  • 6
0
votes
1 answer

How to create delta table using Delta lake standalone and write data

I am able to read delta table which is created in Amazon S3 using standalone api, but unable to create delta table and insert data to it. In below link for delta lake it is mentioned to use Zappy reader and writer which is fictitious and used as…
Venom
  • 1
  • 1
0
votes
1 answer

How to merge data into Delta Table without inserting certain columns (PySpark)

Context: Delta Lake allows developers to merge data into a table with something called a Merge Statement. I am using Delta Lake's Change Data Feed feature to determine whether I want to insert, update, or delete a certain row. This is determined by…