Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Schema Evaluation with overwrite in Delta tables

I am trying to include schema change for new type and dropped column in Delta tables.As per documentation As per documentation: df3 .write .format("delta") .mode("overwrite") .option("mergeSchema", "true") .save(deltapath) This way I…
Ritz
  • 3
  • 1
0
votes
0 answers

Multi-cluster writes to Delta Lake Storage in S3

Trying out https://delta.io/blog/2022-05-18-multi-cluster-writes-to-delta-lake-storage-in-s3/ https://docs.google.com/document/d/1Gs4ZsTH19lMxth4BSdwlWjUNR-XhKHicDvBjd2RqNd8/edit# DynamoDB tables jar files jar file path provided When i run my…
Soumil Nitin Shah
  • 634
  • 2
  • 7
  • 18
0
votes
1 answer

Databricks SQL AddColumn While Creating Delta Table

I am trying to create a delta table with an added column in the DBSQL metastore from a delta bucket. I do not want to pass in the schema as this may change in the bucket over time but I do want to add a column to the metastore only that is a…
Prof. Falken
  • 499
  • 6
  • 21
0
votes
0 answers

write delta lake in Databricks error: HttpRequest 409 err PathAlreadyExist

Sometimes I get this error when a job in Databricks is writing in Azure data lake: HttpRequest:…
0
votes
0 answers

How to set delta delta.logRetentionDuration localy?

I can easily set this option on databricks, but how to do the same with the local pyspark and delta.spark library I've tried to set this option in sparkConf, but it doesn't work ("spark.databricks.delta.logRetentionDuration", "interval 52 weeks")
Col1ns
  • 25
  • 4
0
votes
0 answers

Update metadata to add existing file to Delta Lake

I have many small JSON files and am hoping to aggregate them into larger Parquet files using AWS Athena, write the output to the S3 folder where the Delta Table is stored, and then append the data to Delta Lake. I am hoping to update the metadata…
jaymett
  • 1
  • 1
0
votes
1 answer

Unable to create file using Spark on Client Mode

I have Spark 3.1.2 running on Client mode on K8S (I have 8 workers). I setup a NFS storage to update a delta file stored on it. My spark is throwing the following error to me: java.io.IOException: Cannot create…
OdiumPura
  • 444
  • 5
  • 25
0
votes
1 answer

Does Azure Databricks use Query Acceleration in Azure Data Lake Storage?

Does Azure Databricks use the query acceleration functions in Azure Data Lake Storage gen2? In documentation we can see that spark can benefit from this functionality. I'm wondering if, in the case where I only use the delta format, I'm profiting…
0
votes
3 answers

Azure Data Factory DataFlow Sink to Delta fails when schema changes (column data types and similar)

We have an Azure Data Factory dataflow, it will sink into Delta. We have Owerwrite, Allow Insert options set and Vacuum = 1. When we run the pipeline over and over with no change in the table structure pipeline is successfull. But when the table…
GVFLUSA
  • 25
  • 4
0
votes
1 answer

File is not created when CREATE TABLE in open source delta lake

I am using AWS EMR and I am using an open-source delta lake. In Python, dataframe.write.format('delta').save() works fine. But I want to use it in SQL. I tried to create a delta table in SQL as below. spark.sql(''' CREATE OR REPLACE TABLE test.foo …
dahuin
  • 67
  • 7
0
votes
0 answers

PySpark pandas converting Excel to Delta Table Failed

I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not…
0
votes
0 answers

data published to pulsar by apache nifi (pulsar-nifi-bundle) is not persisted in lakehouse using delta lake sink connector

Does anyone know if this pulsar nifi connector can publish to a deltasink ( delta lake sink connector ) https://github.com/david-streamlio/pulsar-nifi-bundle I am getting data to pulsar with nifi and I will like to move this data from pulsar to…
sejuba
  • 63
  • 5
0
votes
1 answer

Converting JSON .gz files into Delta Tables

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost…
0
votes
0 answers

Conflict Between Compaction and Insert in Deltalake using Spark

I am getting an issue while using Deltalake in AWS cloud.. we are using EMR-EKS to run spark jobs and saving data on S3. The project in which I am using Deltalake has late-arriving records coming every hour. The data is partitioned (time sorted)…
0
votes
0 answers

Conflict (in meta) Between Compaction and Insert in Deltalake using Spark

I am getting an issue while using Deltalake in AWS cloud.. we are using EMR-EKS to run spark jobs and saving data on S3. The project in which I am using Deltalake has late-arriving records coming every hour. The data is partitioned (time sorted)…