Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Where is the /tmp delta lake table when writing in Azure Synapse?

I was following a tutorial (such as this one) and others have suggested that it is a private/temporary delta table. I am running my code in a Synapse workspace, obviously connected to a spark session, and it runs fine, with the write command…
0
votes
0 answers

Reading large number of XMLs using UDF to Delta table in Spark Streaming is very slow

We have a repository of input files like \2023\01\*\*Events*.xml => This represents the path to input XML files that need to be read in Spark Structured Streaming, so that the events can be then parsed, converted to relevant Dataframe and…
0
votes
1 answer

DeltaSharing with CDF complains: cdf is not enabled on table - although reading with Delta lake works

SETUP: Spark: 3.2.3 DeltaSharing Test-Server running locally I am writing and reading data into a Deltalake with Spark. Now I like to enable CDF for being able to read only the changes permanently with using DeltaSharing. DeltaSharing without CDF…
thhappy
  • 13
  • 2
0
votes
1 answer

Delta Lake Replace Where SQL Clause Error

Please help, getting this error when using Delta "replace where" SQL (not Python): ParseException: mismatched input 'replace' expecting {'(', 'DESC', 'DESCRIBE', 'FROM', 'MAP', 'REDUCE', 'SELECT', 'TABLE', 'VALUES', 'WITH'}(line 1, pos 72) == SQL…
bda
  • 372
  • 1
  • 7
  • 22
0
votes
0 answers

How to read delta lake table via an s3 bucket

As seen from the documentation here it is possible to use the deltalake::open_table method to open the delta lake table located on the file system. This I have been able to do. But I have not been able to figure out how to open delta lake table on…
Finlay Weber
  • 2,989
  • 3
  • 17
  • 37
0
votes
1 answer

Saved delta file reads as an df - is it still part of delta lake?

I have problems understanding the concept of delta lake. Example: I read a parquet file: taxi_df = (spark.read.format("parquet").option("header", "true").load("dbfs:/mnt/randomcontainer/taxirides.parquet")) Then I save it using…
BigMadAndy
  • 153
  • 1
  • 9
0
votes
0 answers

Duplication data from streams on merge in Delta Tables

I have a source table with say following data +----------------+---+--------+-----------------+---------+ |registrationDate| id|custName| email|eventName| +----------------+---+--------+-----------------+---------+ | 17-02-2023| 2|…
0
votes
0 answers

Are Glue Crawlers necessary to use Delta Tables in Athena

I'm doing some testing to integrate the Delta lake format into AWS Athena, currently, I have some delta tables already in Athena by manually creating the symlink format manifest. I was reading a recent article that allows you to create AWS Delta…
0
votes
2 answers

java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I am new to spark and delta-lake and trying to do one POC with pyspark and using minio as delta-lake's storage backend. However, I am getting error that Class org.apache.hadoop.fs.s3a.S3AFileSystem not found I have added the jar in python code and…
Aman
  • 193
  • 2
  • 15
0
votes
0 answers

Spark 3.3.1 is throwing error could not find file

I am getting the below error even though I am running as administrator C:\Spark\spark-3.3.1-bin-hadoop3\bin> C:\Spark\spark-3.3.1-bin-hadoop3\bin>spark-shell --packages io.delta:delta-core_2.12:1.2.1,org.apache.hadoop:hadoop-aws:3.3.1 --conf…
Aman
  • 193
  • 2
  • 15
0
votes
1 answer

Azure Synapse Delta Table Creation and Import Data From ADLS delta lake

We have requirement to load the data from ADLS delta data into synapse table. actually, we are writing the delta format data into ADLS gen2 from databricks. now we want to load the data from ADLS gen2(with delta table) to synapse table delta table.…
Developer KE
  • 71
  • 1
  • 2
  • 14
0
votes
0 answers

Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table

I have a pipeline like this: kafka->bronze->silver The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming. I changed the silver schema, so I want to reload from the bronze into…
0
votes
1 answer

How You will ZOrder and Bloom filter this Delta Lake table

How I should implement indexing strategy this fact table ? Contain about 5 mlns of rows Is worth to add bloom filter index here also ? If so, in which way ?
Bartosz
  • 35
  • 4
0
votes
0 answers

Delta: Insert based on condition (WhenMatchedInsert)

I am looking for a smarter way to perform an insert into a delta table based on a condition that does InsertWhenMatched where I don't need to fake skipping the update part of the merge with the update_condition = "true = false". I wasn't able to…
Nikolaos
  • 21
  • 2
0
votes
0 answers

How can I improve the performance of a CDC merge operation between two tables in Spark DeltaLake?

I have implemented a merge operation for change data capture (CDC) between two tables, bronze and silver using Spark Databricks in Delta Lake. The bronze table has a total record count of 500 million to 1 billion and the silver table has 200-300…
icyanide
  • 31
  • 5