Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Stored procedure to Create external table for delta format error Incorrect syntax near ')'

This stored procedure is to create delta external tables in Azure Synapse serverless database. The environment is Azure Synapse OnDemand server (serverless database). CREATE OR ALTER PROCEDURE sp_copy_delta_to_table (@table_name NVARCHAR(100), …
mxendire
  • 41
  • 1
  • 6
0
votes
1 answer

Modify Code to support Delete in Delta Lake Tables with Databricks

In Databricks SQL and Databricks Runtime 12.1 and above, you can use the WHEN NOT MATCHED BY SOURCE clause to UPDATE or DELETE records in the target table that do not have corresponding records in the source table. Databricks recommends adding an…
Patterson
  • 1,927
  • 1
  • 19
  • 56
0
votes
0 answers

Create Native table - AWS Glue Crawler - Pulumi Python AWS

I'm trying to deploy a glue crawler using the pulumi_aws.aws.glue.Crawler. In this specific case I was trying to deploy a delta lake crawler which should Create Native Tables instead of Create Symlink Tables. When checking the pulumi documentation,…
0
votes
0 answers

Polars - Failed to read delta config: Validation failed - Unknown unit 'days'

I am having some delta tables created in my azure data lake using azure databricks. I am trying to follow the below instruction, and to read them using…
Grevioos
  • 355
  • 5
  • 30
0
votes
1 answer

How to reset (clean up) delta table log, but keep the data

Let's say I'm at the point where delta log of delta table has become too big, and I'm 100% confident that it's OK to treat the current version of table as version 0 and discard delta log for good. What's the best way to clean up, reset delta log but…
0
votes
1 answer

Is there a way to have a delta table that constantly deletes data that exceeds five minute of lifespan based upon a createTime column inside the data?

I have a writeStream that writes data from into a delta table. It filters out all data that are beyond five minute and only keep those under. But as time goes on, the ones that were input are now beyond five minute, so I need to have a way to delete…
Trodenn
  • 37
  • 5
0
votes
1 answer

Delta Lake Table: How to restore to previous version only for a specific partition

I have a use case to process data into Delta Lake table by partition. All the partitions in the table are disjoint, meaning they don't speak to each other. When I process data into specific partition, it includes various operations like inserts,…
0
votes
1 answer

How to connect databricks delta tables from Django application

I am new to Django framework, in my project I have requirement that I have to connect to databricks delta tables and perform CRUD operation on it. If we can connect it. I request you to list down the steps please.Thanks.
0
votes
0 answers

Unable to perform a join on delta table with Hash value

Delta Table A : Member MemberID , FirstName , HashFN 1 ABC bzh7IbKC0CEqmrPLMY/ExQ== Delta Table B : MemberHashh CustomerID , FirstName , HashFN 1 EFG bzh7IbKC0CEqmrPLMY/ExQ== I have to perform join on delta table…
0
votes
2 answers

Spark and Delta Lake [resolving dependancies (Maven and Spark repositories)]

Good evening, I will have to use Spark over S3 , using parquet as file format and Delta Lake for "Data Management". The link between Spark and S3 has been solved. But when I try to use DeltaLake with Spark (using python) ... I get this error…
BFR92
  • 23
  • 6
0
votes
1 answer

Table not updating using MERGE INTO

I have a dataframe which is converted to a Spark df in Azure Databricks, then a temporary view: spark_df = spark.createDataFrame(df) spark_df.createOrReplaceTempView("myTemp") I use the following to insert columns to an existing table: spark.sql( …
pymat
  • 1,090
  • 1
  • 23
  • 45
0
votes
0 answers

what is difference between running pyspark code directly with python and with spark-shell?

I have a code that user data from a postgres database and save it in a delta lake: import pyspark from delta import * import time start_time = time.time() builder = (pyspark.sql.SparkSession.builder.appName("MyApp") …
Tavakoli
  • 1,303
  • 3
  • 18
  • 36
0
votes
0 answers

Hot to get rid of 'Unable to load native-hadoop library for your platform' with local Spark and Delta Lake

I checked this question and tried every single answer except rebuilding Hadoop (which is failing with endless errors so far). My hope is that the binaries from the official Hadoop distribution will do, but I can't make it work. Dockerfile: FROM…
greatvovan
  • 2,439
  • 23
  • 43
0
votes
1 answer

Does Polars support predicate pushdown with Deltalate?

I'm trying to load a Deltalake table in Python in a memory-bound environment (k8s Pod with memory limit) with Polars, I am getting an OOM exception when trying to do a scan_delta(...).head().collect(). I am unable to determine if predicate pushdown…
martin8768
  • 565
  • 5
  • 20
0
votes
0 answers

Delta table not updating the data in delta logs but the partition files

We are using delta table for state tracking and we have around 56 partitions. The thing is when we are inserting/updating the data to the delta table it is being updated in the partitions but maybe not the delta logs. Because when we try to read the…