Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Unable to view checkpoint paraquet files after operation in DBFS using Databricks Runtime Version 11.0 (includes Apache Spark 3.3.0, Scala 2.12)

I'm facing an issue with viewing checkpoint files in DBFS (Databricks File System) within my Deltalake environment over S3. According to my understanding, a checkpoint is supposed to be created for every 10th file, but I'm unable to see any…
Ash3060
  • 188
  • 2
  • 15
0
votes
0 answers

How clear Delta Lake Version History on Databricks

After issuing code describe history '/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1' I see I have 28 versions. I would like to clear the version history. I want to carry out a few tests on my table from scratch, but I don't know how to…
Patterson
  • 1,927
  • 1
  • 19
  • 56
0
votes
1 answer

How to access the Earliest Version Number of Delta table as Integer?

I recieved help from @Jacek Laskowski on how to access the latest version of a Delta table as an integer here The code that he recommended was as follows: from delta.tables import DeltaTable import pyspark.sql.functions dt =…
Patterson
  • 1,927
  • 1
  • 19
  • 56
0
votes
0 answers

Writing old dates to a Delta table using Pyspark throws an error, even when using the recommended datetimeRebaseModeInWrite configuration

Writing to Delta table in Python - error writing very old dates. I am trying to write some updates to a Delta table on S3. For this I am using the Python delta-spark package. When trying to run a merge statement, the job crashes. The error is due to…
0
votes
0 answers

How to handle source table re-created from scratch?

I have two delta tables, one source and one destination and I batch-stream (using Trigger.AvailableNow()) from source to destination. When the source table is overwritten then the next run fails because the destination table does not recognize the…
pgrandjean
  • 676
  • 1
  • 9
  • 19
0
votes
0 answers

How can I convert a DataStream of Avro SpecificRecord to Flink's RowData in Apache Flink 1.17.0?

Using scala 2.12.17 I am trying to convert an Apache Flink 1.17.0 DataStream of Avro SpecificRecord (org.apache.avro.specific.SpecificRecord) to a DataStream of Flink's RowData (org.apache.flink.table.data.RowData) such that I can write the avro…
Benjamin Andersen
  • 462
  • 1
  • 6
  • 19
0
votes
0 answers

Create local DeltaTable without writing to disk for testing purposes

I'm trying to write an unit test for a merge operation on a DeltaTable that I have. So far the approach has been to save the contents of a DataFrame to a local file so that it can then be read via DeltaTable.forPath(...) Seq(MyData(column_1 =…
edu
  • 428
  • 2
  • 10
0
votes
1 answer

How to Create DELTA table in Athena

I've tried to create a DELTA table in AWS(Athena) but I've got an error. reference : https://docs.aws.amazon.com/athena/latest/ug/delta-lake-tables.html#delta-lake-tables-getting-started CREATE EXTERNAL TABLE transformed_tables.test ( col1 …
0
votes
0 answers

Best sequence of using Vacuum, optimize, fsck repair and refresh commands on delta tables

i have a delta table whose size will increase gradually. now we have around 15 million rows while running vacuum command on that table i am getting the below error. ERROR: Job aborted due to stage failure: Task 7 in stage 491.0 failed 4 times, most…
0
votes
0 answers

Difference between ways to specify target path in Spark Structured Streaming?

In Spark Structured Streaming the target path of streaming write operations can be specified either by adding an .option('path', ) or as argument to the .start() method. The latter seems to be preferred with Delta Lake, the…
Kai Roesner
  • 429
  • 3
  • 17
0
votes
1 answer

Is there standard way to get the data lake format from parquet file? (e.g. Apache iceberg, Apache Hudi, Deltalake)

I am writing parquet clean job using PyArrow. However, I only want to process native parquet files and skip over any .parquet files in iceberg, hudi, or deltalake format. This is because these formats require updates to be done through the…
Sam.E
  • 175
  • 2
  • 10
0
votes
0 answers

Databricks merge into deletes from SOURCE

I am using the following query to upsert into databricks table: MERGE INTO my_target_table AS target USING (SELECT MAX(__my_timestamp) AS checkpoint FROM my_source_table) AS source ON target.name = 'some_name' AND target.address =…
Gilo
  • 640
  • 3
  • 23
0
votes
1 answer

clearing unused tables from Deltalake

I am trying to cleanup tables(delta, views, parquet and external) from deltalake. I am first trying to findout username who has accessed the tables. describe history command is only working for delta tables. How can i find users who have created…
Saswat Ray
  • 141
  • 3
  • 14
0
votes
1 answer

MERGE INTO with datalake on AWS Glue inserting rows instead of updating

I am trying to set up a demo Glue job that demonstrates upserts using data lake framework. I have example full load data I have saved as delta table in S3 bucket defined as follows: data = {'visitor': ['foo', 'bar', 'baz'], 'id': [1, 2,…
huehue
  • 50
  • 5
0
votes
1 answer

Deltalake on spark: Failure to initialize configuration while reading a table from azure storage

Context I am trying to read a delta table stored on azure from a local spark cluster. The way I try to reach it is through Azure Data Lake Storage Gen2 (abfss://), not the legacy Blob Storage Spark shell exploration The final goal is a pyspark…
zar3bski
  • 2,773
  • 7
  • 25
  • 58