Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Delta Lake CDF vs. Streaming Checkpoints

I haven't used Delta Lake's Change Data Feed yet and I want to understand whether it could be relevant for us or not. We have the following set-up: Raw Data (update events from DynamoDB) ends up in a Staging Area -> We clean the new data and append…
Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
0
votes
1 answer

Unable to create individual delta table from delta format snappy.parquet files

I have multiple parquet files in storage account and converted all into delta format. Now, I need to save the result into individual delta table for each files. df=spark.read.option("mergeschema","true") \ …
0
votes
0 answers

How to combine delta lake and an external table in a dedicated SQL Pool in Azure Synapse? Alternatives?

I am stuck... At this moment I feel the need to ask the community to help out finding a solution to my problem which is as follows I received a daily load of json files of a particular version containing raw data. My idea is to convert the json…
Anouar
  • 85
  • 5
0
votes
1 answer

spark read a file based on value in Dataframe

I'm reading messages from kafka. The messages schema is - schema = StructType([ StructField("file_path", StringType(), True), StructField("table_name", StringType(), True), ]) For each row in the dataframe that I read, I want to open the…
Kallie
  • 147
  • 9
0
votes
1 answer

How to write data to snowflake with python from parquet file

I am trying to write data from parquet files locally stored in the folder data/. For your information, these files are coming from a Delta Lake. # files_list contains this ['part-00000-c8fc3190-8a49-49c5-a000-b3f885e3a053-c000.snappy.parquet',…
Kaharon
  • 365
  • 4
  • 16
0
votes
1 answer

Error in SQL statement: AssertionError: assertion failed: No plan for DeleteFromTable

WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY columnname ORDER BY columnname) row_num FROM tablename ) DELETE FROM cte WHERE row_num > 1; I am using these query for removing duplicate records from my deltatable, but I get…
0
votes
0 answers

Spark Structured Streaming Foreachbatch [ pyspark]

We are using spark Structured Streaming with foreachbatch to update records in delta table. The number of records part of each batch are random. We have 10000 record in kinesis stream but while creating micro batch it picks random number of…
0
votes
0 answers

Delta Merge Operation not always inserts/updates all of the records

This happens from time to time - this is the strange part My current Solution : Re-run the job ! :disappointed: - but this is very reactive - not happy3 This is how my merge stmt look like: MERGE INTO target_tbl AS Target USING df_source AS Source…
Up_One
  • 5,213
  • 3
  • 33
  • 65
0
votes
2 answers

How to configure hive-jdbc-uber-jar in JDBC Sink Connector

I'm trying to use hive-jdbc-uber-jar and configure the JDBC sink connector. But the connector is throwing error: [2022-08-31 00:21:21,583] INFO Unable to connect to database on attempt 1/3. Will retry in 10000 ms.…
0
votes
1 answer

pyspark write stream to delta no data

I use Pyspark to readStream from kafka, process and writeStream to delta table. pyspark 3.2.1 io.delta 1.2.2 hadoop 3.3.0 This code does not produce any results to output delta when deployed in kubernetes or running in databricks. am I producing no…
0
votes
0 answers

Is it possible read delta table with dbt?

I used dbt (Spark adapter) on a cluster emr on AWS and when I used a table I had to connect to AWS datacatalog. Is there a way to read and use a table written on S3 and use it in the dbt queries?
0
votes
0 answers

How to get Synapse analytics delta table load path?

How load delta table in Synapse using delta table path? Use Synapse Optimize command optimize {tablename}
0
votes
1 answer

External Table in Databricks is showing only future date data

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards…
0
votes
2 answers

How to optimize the PySpark Code to get the better performance

I am trying to fetch when the table (Delta table) was last optimized using below code and the getting the output as expected. This code will for all the tables which are present in the database. table_name_or_path = "abcd" df = spark.sql("desc…
0
votes
1 answer

Add a column to a delta table in Azure Synapse

I have a delta table that I created in Azure Synapse using a mapping data flow. The data flow reads append-only changes from Dataverse, finds the latest value, and upserts them to the table. Now, I'd like to add a column to the delta table. When you…
Steve Platz
  • 2,215
  • 5
  • 28
  • 27