Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

spark flatMapGroupsWithState random lost events

I have a spark job that is composed as fellows: 1- read static dataFrame from Delta Lake. 2- read a stream of dataFrame from Delta Lake. 3- join the stream with the static. 4- do a flatMapGroupsWithState. 5- write output. The problem is I have a…
0
votes
0 answers

convert nested json string column into map type column in spark

overall aim I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns…
Amitoz
  • 30
  • 7
0
votes
0 answers

Cannot Drop the unmannaged Delta lake table through pyspark code

I am trying to drop a unmanaged table but it only drops its metadata. I am using the following Code in Databricks spark.sql("DROP TABLE IF EXISTS…
Umer
  • 25
  • 5
0
votes
0 answers

NoClassDefFoundError: Could not initialize class org.apache.spark.sql.delta.sources.DeltaSQLConf$

I am trying to read the delta log file content in an azure notebook and the code is failing, whereas the same code works in intellij in local. the dependency that i have in the cluster is: delta-core_2.12:2.0.0 Code reference: val configMap…
0
votes
1 answer

Schema change in Delta table - How to remove a partition from the table schema without overwriting?

Given a Delta table: CREATE TABLE IF NOT EXISTS mytable ( ... ) USING DELTA PARTITIONED BY part_a, part_b, part_c LOCATION '/some/path/' This table already has tons of data. However, the desired schema is: CREATE TABLE IF NOT EXISTS mytable ( …
YFl
  • 845
  • 7
  • 22
0
votes
1 answer

streamWriter with format(delta) is not producing a delta table

I am using AutoLoader in databricks. However when I save the stream as a delta table, the generated table is NOT delta. .writeStream .format("delta") # <----------- .option("checkpointLocation", checkpoint_path) .option("path",…
Hanan Shteingart
  • 8,480
  • 10
  • 53
  • 66
0
votes
1 answer

great expectation with delta table

I am trying to run a great expectation suite on a delta table in Databricks. But I would want to run this on part of the table with a query. Though the validation is running fine, it's running on full table data. I know that I can load a Dataframe…
0
votes
1 answer

NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.DeleteFromTable in Intellij

I am trying to use the method .delete() to remove one record from a delta table as follows: val my_dt = : DeltaTable = DeltaTable.forPath(ss, my_delta_path) my_dt.delete("pk= '123456'") When I run my code in Intellij I am getting the following…
0
votes
0 answers

Updating a Delta Table from a Data Fame with specifc logic

I am looking to find the best way to update an existing delta table from a dataframe. As an example, I have the following delta…
FizzyGood
  • 33
  • 8
0
votes
0 answers

Can a table created from DDL with NULL data type hold normal data type later?

I am copying an existing table into a different environment using its DDL. Some of the columns are type 'NULL.' Right now the values for said column are NULL in the sample I am given, but they probably shouldnt be according to their names (will…
Raie
  • 61
  • 6
0
votes
1 answer

Auto increment column in delta table without rekey

I have a dim delta table so far i am calculating dim_id using row_number() + max(dim_id). Dim_id | user_id 1001 | 1 1002 | 3 1003 | 5 1004 | 9 For example if i deleted 1004 id then insert a new user_id like 7 (row_number() +…
0
votes
1 answer

How to handle CSV files in the Bronze layer without the extra layer

If my raw data is in CSV format and I would like to store it in the Bronze layer as Delta tables then I would end up with four layers like Raw+Bronze+Silver+Gold. Which approach should I consider?
Su1tan
  • 45
  • 5
0
votes
1 answer

Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way

I am attempting to carry the following merge statement with PySpark on the table below (please note, this is my first attempt at creating a table on Stack Overflow using HTML snippet, so I have it shows the table - I think you have to click on RUN…
0
votes
0 answers

Can we use Delta Live Tables with open source delta lake like with minIO Object Storage

Can we use Delta Live Tables with open source delta lake. Like currently we are using minIO Object Storage. I would like to know whether DLT be used for transformation in such cases. I am working on my RnD for my project's Data Architecture
0
votes
0 answers

Is there a way move data from a Delta Table in S3 to Redshift using Copy Command?

I've Delta Tables that are created in a S3 Bucket, need to load this data as-is into Redshift tables. The delta table has Symlink Format Manifest generated and some delta tables might have partitions. Is there a way to move this data into…