Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Reference 'unit' is ambiguous, could be: unit, unit

I'm trying to load all incoming parquet files from an S3 Bucket, and process them with delta-lake. I'm getting an exception. val df = spark.readStream().parquet("s3a://$bucketName/") df.select("unit") //filter data! .writeStream() …
Tamás
  • 19
  • 1
  • 6
0
votes
1 answer

How to add Delta Lake support to Zeppelin's spark interpreter?

I'm trying to add the Delta Lake support to Zeppelin. So far I've tried adding the io.delta:delta-core_2.12:0.7.0 dependency to the spark interpreter, as well as a couple other related actions within the interpreters view... but nothing has…
kylemart
  • 1,156
  • 1
  • 13
  • 25
0
votes
1 answer

Streaming data into delta lake, reading filtered results

My goal is to continuously put incoming parquet files into delta-lake, make queries, and get the results into a Rest API. All files are in s3 buckets. //listen for changes val df = spark.readStream().parquet("s3a://myBucket/folder") //write changes…
Tamás
  • 19
  • 1
  • 6
0
votes
1 answer

Error when trying to move data from on-prem SQL database to Azure Delta lake

I am trying to move large amounts of reference data from on-prem SQL server to Delta lake to be used in databricks processing. To move this data, I am trying use Azure Data Factory via simple Copy data activity. but as soon as I start the pipeline I…
rsapru
  • 688
  • 14
  • 30
0
votes
0 answers

Add new column to the existing table in Delta lake(Gen2 blob storage)

Curious to know, can we add a new column to the existing Delta Lake table stored in the Gen2 blob storage, based on the business use case, i will need to add additional 3 more columns to one of the table in delta lake. ALTER TABLE doesn't worked for…
chaitra k
  • 371
  • 1
  • 4
  • 18
0
votes
1 answer

Why Databricks Delta is copying unmodified rows even when merge doesn't update anything?

When I run a following query: merge into test_records t using ( select id, "senior developer" title, country from test_records where country = 'Brazil' ) u on t.id = u.id when matched and (t.id <> u.id) then -- this is just to be sure that nothing…
Jakub Troszok
  • 99,267
  • 11
  • 41
  • 53
0
votes
1 answer

Migrate (on-prem) SQL data to Azure with Databricks (JDBC)

Is it possible to use the JDBC connector https://docs.databricks.com/data/data-sources/sql-databases.html in order to get data from local SQL server. (and export it to delta lake) Using: jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname,…
0
votes
2 answers

ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake

Question main question How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system? rationale We're designing ephemeral data…
0
votes
1 answer

Deltalake error- MERGE destination only supports Delta sources

I am trying to implement scd-type-2 in delta lake and i am getting the following error- "MERGE destination only supports Delta sources". Below is the snippet code i am executing. MERGE INTO stageviews.employeetarget t USING ( …
Kiran A
  • 179
  • 1
  • 2
  • 7
0
votes
2 answers

why pyspark find local hive metastore only from root directory?

I've a question on hive metastore support for delta lake, I've defined a metastore on a standalone spark session with the following configurations pyspark --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"…
0
votes
1 answer

Merge into delta table not working with java foreachbatch

I have created a delta table and now I'm trying to insert data to that table using foreachBatch(). I've followed this example. The only difference is that I'm using Java and not in a notebook, but I suppose that should not make any difference? My…
RudyVerboven
  • 1,204
  • 1
  • 14
  • 31
0
votes
2 answers

pyspark delta lake optimize - fails to parse SQL

I have a delta table created as using spark 3.x and delta 0.7.x: data = spark.range(0, 5) data.write.format("delta").mode("overwrite").save("tmp/delta-table") # add some more files data = spark.range(20,…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
0
votes
0 answers

Databricks Delta Table Merge Example with Comparison with Update

I have found a ton of examples showing how to Merge data using Databricks Delta Table Merge to load data to SQL DB. However, I'm trying to find examples whereby trying to load data to SQL DB without Databricks Delta Merge fails. This is because I'm…
Carltonp
  • 1,166
  • 5
  • 19
  • 39
0
votes
1 answer

List views in delta-lake

I'm trying to figure out a way to list the views in a given delta-Lake database? Is there anything available that's equivalent of sqlServer's INFORMATION_SCHEMA or something obvious that I'm missing? I have tried the following without any…
Manish Karki
  • 473
  • 2
  • 11
0
votes
4 answers

Confusion About Delta Lake

I have tried to read a lot about databricks delta lake. From what I understand it adds ACID transactions to your data storage and accelerated query performance with a delta engine. If so, why do we need other data lakes which do not support ACID…
user13128577