Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
1
vote
1 answer

poetry publish from codebuild to aws codeartifact fails with UploadError

I have a dataset I need to periodically import to my datalake, replacing current dataset After I produce a dataframe I currently do: df.write.format("delta").save("dbfs:/mnt/defaultDatalake/datasets/datasources") But if I run the job again I get…
alonisser
  • 11,542
  • 21
  • 85
  • 139
1
vote
2 answers

Get a spark Column from a spark Row

I am new to Scala, Spark and so struggling with a map function I am trying to create. The map function on the Dataframe a Row (org.apache.spark.sql.Row) I have been loosely following this article. val rddWithExceptionHandling = filterValueDF.rdd.map…
Oliver
  • 35,233
  • 12
  • 66
  • 78
1
vote
0 answers

Databricks delta table truncating column data containing '-'

I am using a delta table to load data from my dataframe. I am observing that the column values which have a '-' in them, are getting truncated. I tried to check the records in the dataframe that I am loading, by loading them to a csv file and I…
1
vote
1 answer

Simba ODBC connection to delta table & read data from delta format tables using .Net C#

I am trying to read data from delta format tables using c# via simba odbc driver. delta format table sample : https://docs.delta.io/latest/quick-start.html#-create-a-table&language-python Have downloaded and configured simba odbc as instructed…
Rak
  • 196
  • 2
  • 9
1
vote
1 answer

How do I upsert stateful events to a Delta Lake table with an existing streaming DF?

I am trying to upsert events from Kafka into a Delta Lake table. I do this with this. New events are coming in fine, values in the delta table are updated based on the merge condition. Now when I stop execution and then rerun the upsert script,…
1
vote
0 answers

Azure Databricks : Find the files used by databricks delta table from azure blob storage

I see in the 'data' tab of databricks that the number of files used by delta table is 20000(size:1.6TB). But the actual file count on the azure blob storage where the delta stores the files is 13.5 Million (size: 31 TB). The following checks were…
1
vote
1 answer

I receive the error "Cannot time travel Delta table to version X" whereas I can see the version X when looking at the history on Azure Databricks

I have a table in delta lake which has these tblproperties: I'm trying to access a version which was there last month, the 322. When I look at the history, I can see it: But when I try to access it with such a…
Nastasia
  • 557
  • 3
  • 22
1
vote
0 answers

Creating Gold Table on Delta lake

We have to create Gold tables on Delta from real time data The data which is coming from backend is Real time and I want to insert and update them on real time in the gold table Please provide any suggestions
Aditya Verma
  • 201
  • 4
  • 14
1
vote
1 answer

Databricks Delta files adding new partition causes old ones to be not readable

I have a notebook using which i am doing a history load. Loading 6 months data everytime, starting with 2018-10-01. My delta file is partitioned by calendar_date After the initial load i am able to read the delta file and look the data just…
Krish
  • 390
  • 4
  • 15
1
vote
0 answers

DataBricks - Ingest new data from parquet files into delta / delta lake table

I posted this question on the databricks forum, I'll copy below but basically I need to ingest new data from parquet files into a delta table. I think I have to figure out how to use a merge statement effectively and / or use an ingestion tool. I'm…
Dudeman3000
  • 551
  • 8
  • 21
1
vote
1 answer

Cannot read file from azure databricks

I am running this command to read data from Azure databricks from a plain cluster (hadoop not installed). spark-submit --packages io.delta:delta-core_2.12:0.7.0 \ --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ --conf…
Srinivas
  • 2,010
  • 7
  • 26
  • 51
1
vote
0 answers

Update existing rows with Spark in Delta Lake without affecting data written in the meantime by another job

Goal: I want to update an existing column in a Delta Lake table with a periodically run Spark job A while being able to run another periodical Spark job B that adds new data, without suffering data loss. Problem: As far as I know I need to use…
Daniel Müller
  • 426
  • 1
  • 5
  • 19
1
vote
1 answer

How to access gold table in delta lake for web dashboards and other?

I am using the delta lake oss version 0.8.0. Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake. My question is, is there a well known way to access these gold table data…
Chris
  • 35
  • 6
1
vote
1 answer

Pushdown filter in case of spark structured Delta streaming

I have a use case where we need to stream Open Source Delta table into multiple queries, filtered on one of the partitioned column. Eg,. Given Delta-table partitioned on year column. Streaming query…
Amit Joshi
  • 172
  • 1
  • 14
1
vote
0 answers

Is Delta Table well suited for continuously changing entities?

I have this legacy system which streams records into a queue (Azure Event Hubs) in the pace they are changed and, every 24h, another process reads all records and dumps them all into the stream. This mechanism let's any consumer to recreate the data…
Igor Gatis
  • 4,648
  • 10
  • 43
  • 66