Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

How to set up Spark SQL to work with Delta Lake tables with Glue metastore?

I followed this instruction to set up a Delta lake table and I can query it with Athena but not with Spark SQL. It is a Delta Lake table that has a metastore defined in GLUE. If I execute the following query spark.sql("SELECT * FROM…
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
0
votes
0 answers

Azure Databricks community edition COPY INTO command error

I have been getting the following error upon running the COPY INTO command in Databricks Community edition: Error in SQL statement: UnsupportedOperationException: com.databricks.backend.daemon.data.client.DBFSV1.createAtomicIfAbsent(path:…
SAR182
  • 11
  • 3
0
votes
1 answer

How can I transition from Azure Data Lake, with data partitioned by date folders into delta lake

I own an azure data lake gen2 with data partitioned by datetime nested folders. I want to provide delta lake format to my team but I am not sure if I should create a new storage account an copy the data into delta format or if it would be best…
0
votes
2 answers

Azure Purview scan on Databrick delta-lake shows "Error: (3913) JavaException: Must have Java 8 or newer installed"

We have our delta-lake from which we can query data. We also connected it with Power BI to make interactive dashboards. It's in production. Now we want to use Azure Purview to avail all the data governance and data catalog things on top of this…
pd farhad
  • 6,352
  • 2
  • 28
  • 47
0
votes
2 answers

Unable to read Databricks Delta / Parquet File with Delta Format

I am trying to read a delta / parquet in Databricks using the follow code in Databricks df3 = spark.read.format("delta").load('/mnt/lake/CUR/CURATED/origination/company/opportunities_final/curorigination.presentation.parquet') However, I'm getting…
Patterson
  • 1,927
  • 1
  • 19
  • 56
0
votes
0 answers

How can I connect a local delta lake with talend for data profiling purpose?

As I am new to talend, I am trying to connect my local delta lake with talend to do some data profiling on it.
khÜs h
  • 51
  • 6
0
votes
1 answer

How to determine the number of executors to read a delta table?

I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07) Now, with the data keep ingesting via batch and streaming ingestion workflow, what would…
0
votes
1 answer

Delta Table Access Restriction by Process

Is there away to restrict access to a Delta Table based on a process or client id ? Here is my scenario: I have a streaming job that writes to a delta table, and sometimes the job fails due to concurrency issues or merge collision that are triggered…
Up_One
  • 5,213
  • 3
  • 33
  • 65
0
votes
1 answer

Overwrite specific partitions in spark dataframe write method with Delta format

Able to overwrite specific partition by below setting when using Parquet format, without affecting data in other partition…
0
votes
2 answers

Delete from adls gen 2 Delta files fails with error

I have requirement where I am deleting duplicate records from delta file using databricks sql. Below is my query %sql delete from delta.`adls_delta_file_path` where code = 'XYZ ' but it gives below…
0
votes
1 answer

How to add an auto increment column in an existing delta table in databricks

In Databricks I have a existing delta table, In which i want to add one more column, as Id so that each row has unique id no and It is consecutive (how primary key is present in sql). So far I have tried converting delta table to pyspark dataframe…
0
votes
2 answers

Getting AnalysisException on Databricks while create delta format

Am trying to write dataframe as .delta format but getting 'AnalysisExcpetion' code: df = spark.read.format("csv").option("header",…
Sreedhar
  • 29,307
  • 34
  • 118
  • 188
0
votes
2 answers

Delta table configuration in open source and databricks version are different

I'm trying to setup the isolation level to Serializable on open source delta using azure synapse notebook. Command : ALTER TABLE schema.table SET TBLPROPERTIES ('delta.isolationLevel' = 'Serializable') It seems like delta is not able to identify…
Kiran Gali
  • 101
  • 6
0
votes
1 answer

Combine batch data to delta format in a data lake using synapse and pyspark?

I currently have a data lake with several daily interval tables of data in the bronze layer of a data lake. They are in csv format and regularly new daily csv tables are ingested to the bronze folder. I would like to transform them e.g. editing some…
AzUser1
  • 183
  • 1
  • 14
0
votes
1 answer

Spark stuck on SynapseLoggingShim.sala while writing into delta table

I'm streaming data from kafka and trying to merge ~30 million records to delta lake table. def do_the_merge(microBatchDF, partition): deltaTable.alias("target")\ .merge(microBatchDF.alias("source"), "source.id1= target.id2 and…
Kiran Gali
  • 101
  • 6