Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
2 answers

Not able to read from s3 using spark when using 2.8.0 hadoop aws jar and not able to write as delta table to s3 when using hadoop aws 2.7.3

i'm unable to access s3 from spark when i use hadoop aws jar 2.8.0. Basically i want to rad a (parquet)file from s3 and write it as a delta table in s3. //Spark shell command spark-shell --packages…
Raptor0009
  • 258
  • 4
  • 14
0
votes
0 answers

How to handle duplicates using foreachBatch in spark structured streaming in case of stream termination?

I have a stream that uses foreachBatch and keeps checkpoints in a data lake, but if I cancel the stream, it happens that the last write is not fully commited. Then the next time I start the stream I get duplicates, since it starts from the last…
0
votes
2 answers

deltaTable update throws NoSuchMethodError

I started to look into delta lake and got this exception when trying to update a table. I'm using: aws EMR 5.29 Spark 2.4.4 Scala version 2.11.12 and using io.delta:delta-core_2.11:0.5.0. import io.delta.tables._ import…
Guy Harari
  • 11
  • 1
0
votes
1 answer

Use Unmanaged table in Delta lake on Top of ADLS Gen2

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many…
0
votes
1 answer

Databricks Delta Lake + ADSL + Presto

Databricks has just released a public preview of Delta Lake and Presto integration. I'm new to Azure, and the link has multiple mentions of EMR and Athena but lack Azure keywords. So I have to ask a stupid question: Am I right that Presto…
VB_
  • 45,112
  • 42
  • 145
  • 293
0
votes
1 answer

Why does writeStream not write in delta format, even though I have coded it

Here is my code. The writeStream is writing records in "parquet" format but not in "delta", even though I have mentioned delta format. spark .readStream .format("delta") .option("latestFirst","true") .option("ignoreDeletes",…
0
votes
3 answers

Chaining Delta Streams programmatically raising AnalysisException

Situation : I am producing a delta folder with data from a previous Streaming Query A, and reading later from another DF, as shown here DF_OUT.writeStream.format("delta").(...).start("path") (...) DF_IN =…
Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130
0
votes
1 answer

Does delta lake support update with join?

Is it possible to do update on a delta lake table with join? In mysql (and other databases) you could something like update table x join table y on y.a=x.a set x.b=y.b where x.c='something' Do we have something similar in delta? I know they…
Ridwan
  • 301
  • 5
  • 12
0
votes
2 answers

Handling duplicates while processing Streaming data in Databricks Delta table with Spark Structured Streaming?

I am using Spark Structured Streaming with Azure Databricks Delta where I am writing to Delta table (delta table name is raw).I am reading from Azure files where I am receiving out of order data and I have 2 columns in it "smtUidNr" and "msgTs".I am…
0
votes
1 answer

Delta lake create table from schema

I have the schema associated with a table to be created fetched from confluent schema-registry in below code: private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata("topicName").getSchema private var sparkSchema =…
Vikas J
  • 358
  • 1
  • 5
  • 17
0
votes
1 answer

Append only new aggregates based on groupby keys

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table: operations\ .repartition('date')\ …
0
votes
1 answer

How to drop duplicates while streaming in spark

I have a streaming job that streams data into delta lake in databricks spark, and I'm trying to drop duplicates while streaming so my delta data has no duplications. Here's what I have so far: inputPath = "my_input_path" schema =…
efsee
  • 579
  • 1
  • 10
  • 22
0
votes
1 answer

Write delta file to S3 (MinIO) - PySpark 2.4.3

I am currently trying to write a delta-lake parquet file to S3, which I replace with a MinIO locally. I can perfectly fine read/write standard parquet files to S3. However, when I use the delta lake example Configure delta to s3 It seems I can't…
Thelin90
  • 37
  • 2
  • 11
0
votes
1 answer

'SparkSession' object has no attribute 'databricks'

New to databricks and spark, I'm trying to run the below command and met this error spark.databricks.delta.retentionDurationCheck.enabled= "false" error: 'SparkSession' object has no attribute 'databricks'
efsee
  • 579
  • 1
  • 10
  • 22
0
votes
2 answers

Write to csv file from deltalake table in databricks

How do I write the contents of a deltalake table to a csv file in Azure databricks? Is there a way where I do not have to first dump the contents to a dataframe? https://docs.databricks.com/delta/delta-batch.html
SA2010
  • 183
  • 4
  • 12