Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Great Expectations bad performance on PySpark DataFrame

We want to integrate data quality checks in our ETL pipelines, and tried this with Great Expectations. All our ETL is in PySpark. For small datasets, this is all well, but for larger ones the performance of Great Expectations is really bad. On a…
gamezone25
  • 288
  • 2
  • 10
0
votes
1 answer

How to create delta tables in ADLS Gen2 using spark local without Databricks

I'm trying to read csv files from ADLS gen2 (Microsoft Azure) and create delta tables . I'm able to successfully initiate a sparksession and read the csv files via spark in my local. But when I'm trying to write the dataframe as delta table I'm…
0
votes
2 answers

Delta table partition folder name is getting changed

I am facing an issue where the expected date parition folder should be named in format date=yyyymmdd, but instead writing as - Sometimes for each parquet file created in delta path, it's creating a seperate folder. I don't see any issues with the…
0
votes
0 answers

Delta Lake: File ordering with `coalesce(1)` and `partitionBy()`

Consider the example below: _schema = ['num_col', 'word'] _data = [ (1, 'idA'), (2, 'idA'), (3, 'idB'), (4, 'idC'), (5, 'idC'), (1, 'idC'), (2, 'idC'), ] df = spark.createDataFrame(_data, _schema) out_path = '/tmp/output.delta' _ = ( …
boyangeor
  • 381
  • 3
  • 6
0
votes
0 answers

PySpark Job on Dataproc Throws IOException but Still Completes Successfully

I'm running a PySpark job on Google Cloud Dataproc, which uses structured streaming with a trigger of 'once'. The job reads Parquet data from a raw layer (a GCS bucket), applies certain business rules, and then writes the data in Delta format to a…
0
votes
0 answers

How to append to delta table using Rust?

I have this Python code that executes 3 insert transactions into a delta table import pandas as pd from deltalake.writer import write_deltalake from deltalake import DeltaTable if __name__ == '__main__': # First transaction id_list = [] …
Finlay Weber
  • 2,989
  • 3
  • 17
  • 37
0
votes
1 answer

Different number of partitions after spark.read & filter depending on Databricks runtime

I have parquet files saved in the following delta lake format: data/ ├─ _delta_log/ ├─ Year=2018/ │ ├─ Month=1/ │ │ ├─ Day=1/ │ │ │ ├─ part-01797-bd37cd91-cea6-421c-b9bb-0578796bc909.c000.snappy.parquet │ │ ├─ ... │ │ ├─ Day=31/ │ │ │ …
Oliver Angelil
  • 1,099
  • 15
  • 31
0
votes
1 answer

How to refresh delta table after changing underlying parquet file manually with another file with same name?

I have one table in databricks. Let's call it 'tableA'. This table is not created by me. So to get where its files are stored I checked storage-location for that table. I found it to be Azure BLOB storage. When I checked that particular directory I…
0
votes
1 answer

How can I apply incremental data loading into azure databricks table from azue adls gen2

We have Azure Synapse Link for Dataverse which enables continuouse export data from Dataverse to Azure Data Lake Storage Gen2 (CSV format). Main source of data is D365 CRM. All the files contains columns SinkModifiedOn, and IsDelete. Current…
0
votes
0 answers

Partition sizes with high cardinality column (Timestamp) with Ingestion Time Clustering

I am appending rows to a delta table once per day: (df .write .mode("append") .format("delta") .save("/mytable") ) On each append, a new partition file is created. The problem is that each partition is only around 1mb, so as this table grows, I…
Oliver Angelil
  • 1,099
  • 15
  • 31
0
votes
0 answers

How to set "Retention Period" + "Vacuum" for Delta Tables in Azure Data Factory refreshed by a CDC Pipeline... without using Data Bricks

Key Issue: CDC (preview) in ADF has no "Vacuum" or "Retention Period" setting to delete outdated parquet versions / trim Delta Logs for Delta Tables I am using Azure Data Factory's Change Data Capture feature (currently in Preview) to incrementally…
ashap551
  • 23
  • 1
  • 6
0
votes
0 answers

Reuse auto loader in different storages

I have two storage accounts on Azure old storage new storage Some data in old storage are ingested by auto loader and work well. But now, I'm moving the data from old storage to new storage, including the auto loader with checkpoints, etc, but…
0
votes
1 answer

How To read delta parquet multiple files incremental manner

I have requirement is to read parquet delta files which is multiple files in single folder, actually we are trying read all the files under delta table folder in ADLS location. The requirement is, when we load data in first time, we have to read all…
Developer KE
  • 71
  • 1
  • 2
  • 14
0
votes
0 answers

How to support addition of new columns to dataset case class while reading old delta file without new column?

I have an existing delta file with 4 columns in the schema, which I was converting into dataset at runtime. Case class case class MyObj2(x:int) case class MyObj1(p: MyObj3, q:MyObj3) case class MyCaseClass(a:int, b:MyObj1, c:int, d:MyObj2) Now I…
P Mittal
  • 172
  • 6
0
votes
0 answers

Deltalake merge Pyspark - update only

I have a target table has 100 rows and an incremental table has 20 rows (updates) When performing merge with whenMatchedUpdate using pyspark, in target table 20 rows is getting updated and remaining 80 rows are getting updates as null values , not…
Joe
  • 47
  • 7