Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

How to create Athena Tables of Delta files from AWS S3 using Terraform

I need to create Athena Tables of Delta file format in AWS S3 Datasource using Terraform. This is creating Athena tables. I am able to get the columns name. But it is showing 0 records when i query. But i have delta files in my location. Anything am…
0
votes
0 answers

How to manually check versions to be processed in source table?

I use structured streaming and Delta to keep two tables (A -> B) in sync. Rather than a continuous streaming job, I use the trigger AvailableNow to run the update once a day only. B has a checkpoint tracking the progress from A. When starting a…
pgrandjean
  • 676
  • 1
  • 9
  • 19
0
votes
0 answers

Create a delta table in DBFS

There are 2 sources of data & there is no way to connect these two sources : Different AWS with different subscription account (in 1 bucket there is 2 different folders X & Y) Databricks with different subscription ID (There is 1 table here) I…
slinger
  • 1
  • 2
0
votes
0 answers

Delta Lake error SparkFileNotFoundException when looking for _delta_logs in local docker container

I'm running delta lake on my local host. I setup a spark-master, 2 spark-workers, and one spark-driver docker container. Inside the spark-driver container is where I run spark-submit that points to spark://spark-master:7077 . The spark-driver…
prime90
  • 889
  • 2
  • 14
  • 26
0
votes
0 answers

Spark Delta schema mismatch after save

The following code is failing due to a schema, specifically nullability, mismatch issue. Is this expected behavior in Spark version: 3.2.2 and Delta version: 1.2.1? Can this code be updated so that schema can be enforced? import…
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
0
votes
1 answer

Delta Lake - Building Data Catalog

I'm new to Delta Lake and considering to use Delta lake for one of the project with S3 or GCS as file storage. I would like to understand how the data cataloging works. Does the open source delta lake automatically creates and maintains data catalog…
0
votes
0 answers

delta lake partitioning generates wrong values for folder names

I am trying to partitioning the dataframe using a column GROUP. I have code like this df = DataFrame( data={ 'NAME': ['John', 'Max', 'Henry'], 'GROUP': ['A', 'B', 'A'] }, schema={ 'NAME': pl.Utf8(), …
lapots
  • 12,553
  • 32
  • 121
  • 242
0
votes
0 answers

Error when trying to generate a Delta table from a Parquet file using delta-rs library

I'm attempting to write a Delta table without employing Spark, and I've chosen to use the delta-rs library. I've encountered an issue when trying to generate a Delta table using a Parquet file. Here is the error message I get: thread 'main' panicked…
Evandro Lippert
  • 336
  • 2
  • 11
0
votes
1 answer

failed - pip install delta-lake-reader[azure]

We are building an api in FastAPI framework and need to access delta lake storage from python. I am going through the article https://pypi.org/project/delta-lake-reader/ to install the the package but it is failing with below error: I am using…
0
votes
1 answer

schema mismatch error in databricks while reading file from storage account

I have below script which I run in my unity catalog enabled databricks workspace and get the below error. The schema and code worked for my other tenant in different workspace and I was hoping it was same for this tenant. now I dont have time to…
0
votes
0 answers

Create an empty delta lake table in databricks with schema available

I have the schema defined. How can i crate an empty delta lake table in databricks azure. I see this command with no schema defined but I get error as though I have the all necessary permission.I just created the schema. Can someone get me the…
ZZZSharePoint
  • 1,163
  • 1
  • 19
  • 54
0
votes
1 answer

Change Data Feed not working in delta live tables

When I'm trying to see change data feed changes by applying alter table set delta.enableChangeDataFeed = true. Property was enabled but i couldn't see any changes. I'm getting error like there are no changes I have set Property is true but i…
0
votes
0 answers

expire S3 objects after deletion from Delta Lake without breaking meta data

We collect raw data from various data delivery streams in S3, in Delta format. We choose Delta mainly because we want an easy way to compact the many small objects into bigger S3 objects, that can later be processed more (cost-)efficiently. We want…
0
votes
0 answers

Avoid writing to file after each update in DeltaTable

I am performing transformations on a Delta table using PySpark and the DeltaTable library. The code: from delta import * delta_tb = DeltaTable.forPath(spark, 'path/to/table') delta_tb.update(set = {'ma': 'round(ma, 5)'}) to round the ma column to…
razumichin
  • 84
  • 1
  • 3
  • 6
0
votes
2 answers

How do I apply a filter on a map type column in a Pyarrow table while loading?

I have a file written in the Deltalake/Parquet format which has a map as one of the columns. The map stores various properties of the row entry in a "property_name": "property_value" format. I'd like to filter on a particular property stored in this…