Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
1
vote
1 answer

Delta Lake: Performance Challenges

Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table. Approach 2: I had implemented the delta lake where the output pandas dataframe is…
1
vote
1 answer

Deleting from a DeltaTable using a dataframe of keys

I want to perform a delete operation on a DeltaTable, where the keys to be deleted are already present on a DataFrame. Currently I am collecting the DataFrame on the driver, and then running delete operation. However it seems very inefficient to…
ksceriath
  • 184
  • 1
  • 12
1
vote
1 answer

Snowflake interprets boolean values in parquet as NULL?

Parquet Entry Example (All entries have is_active_entity as true) { "is_active_entity": true, "is_removed": false } Query that demonstrates all values are taken as NULL select $1:IS_ACTIVE_ENTITY::boolean, count(*) from…
1
vote
1 answer

Optimize blob storage Deltalake using local scope table on Azure Databricks

How can you optimize an Azure blob storage delta table on Azure Databricks, while not putting the table to a global scope? Optimizing and z-ordering a delta table on an Azure blob storage can be done via (cf. docs): spark.sql('DROP TABLE IF EXISTS…
0vbb
  • 839
  • 11
  • 27
1
vote
2 answers

DeltaLake: How to Time Travel infinitely across Datasets?

The Use Case: Store versions of Large Datasets (CSV/Snowflake Tables) and query across versions DeltaLake says that unless we run vacuum command we retain historical information in a DeltaTable. And Log files are deleted every 30 days. Here And…
1
vote
0 answers

How to use delta lake with Spark 2.4.4

I'm using Spark 2.4.4, when I enter pyspark shell, I specify delta lake and jackson packages as below: pyspark --packages io.delta:delta-core_2.11:0.6.1,com.fasterxml.jackson.module:jackson-module-scala_2.11:2.6.7.1 --conf…
Alex Liu
  • 11
  • 2
1
vote
1 answer

How to insert into Delta table in parallel

I have a process which in short runs 100+ of the same databricks notebook in parallel on a pretty powerful cluster. Each notebook at the end of its process writes roughly 100 rows of data to the same Delta Lake table stored in an Azure Gen1…
1
vote
1 answer

Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

Using .vacuum() on a DeltaLake table is very slow (see Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs). If I manually deleted the underlying parquet files and did not add a new json log file or add a new…
Clay
  • 2,584
  • 1
  • 28
  • 63
1
vote
2 answers

Delta lake transaction log - remove properties

I am trying to convert csv files to delta format. The conversion is occurring successfully But I can see the remove property in second json transaction file with details of first csv file in parquet as below: For first json transaction file there is…
Vishnu.K
  • 41
  • 2
  • 6
1
vote
1 answer

Spark Maven dependency incompatibility between delta-core and spark-avro

I'm trying to add delta-core to my scala Spark project, running 2.4.4. A weird behaviour I'm seeing is that it seems to be in conflict with spark avro. Maven build succeeds, but during runtime I'm getting errors. If delta table dependency is…
Assaf Neufeld
  • 703
  • 5
  • 10
1
vote
1 answer

Configuring TTL on a deltaLake table

I'm looking for a way to add ttl(time-to-live) to my deltaLake table so that any record in it goes away automatically after a fixed span, I haven't found anything concrete of yet, any one knows if there's a workaround with this?
Manish Karki
  • 473
  • 2
  • 11
1
vote
2 answers

Delta Table Insert not Working Correctly, Read Errors out with - org.apache.spark.sql.AnalysisException: Table does not support reads

I am using Spark version 3.0.0 , delta version : io.delta:delta-core_2.12:0.7.0 on Apache Zeppelin Notebook. Below scenario i tried to insert data into delta table, PFB Apache Zeppeline Screenshot STEP 1: spark.sql("drop table if exists…
1
vote
3 answers

Read /Write delta lake tables on S3 using AWS Glue jobs

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined" from pyspark.sql import SparkSession from pyspark.conf import SparkConf spark =…
Vidya821
  • 77
  • 2
  • 11
1
vote
1 answer

Fail to rename json files in the "_delta_log" directory when using Delta Lake on Azure Blob Storage

facing issue while renaming _delta_log json file in case of parllel append operation on single table Attempt recovered after RM restartUser class threw exception: java.io.IOException: rename from…
Shalaj
  • 579
  • 8
  • 19