Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
2 answers

need only updated quantity based on the current month using pyspark delta loads using databricks

I am loading the delta tables into S3 delta lake. the table schema is product_code,date,quantity,crt_dt. i am getting 6 months of Forecast data, for example if this month is May 2022, i will get May, June, July, Aug, Sept, Oct quantities data. What…
Krishna
  • 35
  • 5
0
votes
1 answer

How to serve as of time query on a append only delta table having single timestamp column using spark

I need to crack a spark query, Let say I have my data in delta table(tab) like:- cust date acct_id f1 f2 source_date(dd/mm/yy:h) b1 1/10/22 acc1 x y 9/9/22:1 P.M b1 1/10/22 acc2 x y 9/9/22:1…
Kumar-Sandeep
  • 202
  • 1
  • 4
  • 14
0
votes
0 answers

Loading delta tables in batches

Delta Lake tables follow the "WRITE ONCE READ MANY(WORM)" concept which means the partitions are immutable. This makes sense and usually the approach most of the other datawarehouse products also take. This approach, however has write explosion.…
Rajib Deb
  • 1,496
  • 11
  • 30
0
votes
2 answers

Impala Delta Lake Integration

I have set up Delta Lake in Cloudera. It works fine with Spark and Hive. I have searched enough on the internet to integrate Delta Lake with Impala. I did not find much information. Can someone please answer if you have done the same? Update: Do not…
vijayinani
  • 2,548
  • 2
  • 26
  • 48
0
votes
0 answers

Expand record column from Delta Lake table in Power BI

Example code to create the table in Apache Spark's PySpark shell: from pyspark.sql.types import StructType,StructField, StringType, IntegerType, MapType data2 = [({ "firstname2": "John", "lastname2": "Smith" }, "36636","M",3000), ({…
0
votes
0 answers

Star Schema with Delta Tables: No referential integrity?

In a typical star schema we have fact tables and dimension tables. Reading that article it seems like databricks suggests to use delta tables for realizing the star schema. However, delta tables do not support referential integrity - see here and…
user3579222
  • 1,103
  • 11
  • 28
0
votes
0 answers

How to delete duplicates from delta table which doesnt have any primary key

I want to delete identical rows from delta table which doesnt have any primary key. how to achieve this scenario? if i have a delta table like below: i need to remove the duplicates by comparing entire row and i need result like how to achieve…
0
votes
0 answers

why a _tmp_path_dir is created when we write a spark dataframe as delta table

I was just going in depth through spark Delta transaction log and metrics that it stores. While on that analysis , I have noticed that whenever I am writing a spark dataframe as a delta table (I am writing in to an azure Gen2 storage) , it is…
akhil pathirippilly
  • 920
  • 1
  • 7
  • 25
0
votes
1 answer

Scala: best way to update a deltatable after filling missing values

I have the following delta table +-+----+ |A|B | +-+----+ |1|10 | |1|null| |2|20 | |2|null| +-+----+ I want to fill the null values in column B based on the A column. I figured this to do so: var df = spark.sql("select * from MyDeltaTable") val…
Haha
  • 973
  • 16
  • 43
0
votes
1 answer

Convert storage account Azure into Databricks delta tables

I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files. I have connected an…
brian_ds
  • 317
  • 4
  • 12
0
votes
1 answer

How to transfer a Spark table from one schema to another?

Is it possible to transfer a table from one schema to another schema in Spark Delta Lake just like ALTER SCHEMA new_schema_name TRANSFER old_schema_name.table_name in SQL Server without having to drop and create table again? I am using Spark Delta…
Merin Nakarmi
  • 3,148
  • 3
  • 35
  • 42
0
votes
1 answer

Restore delta table to previous version after creating a copy of version with current date in databricks

I want to restore previous version of delta table by creating its copy first with copy job run date folder name and then restore delta table using that copy file Any suggestion here. Here's what I'm trying: version_timestamp =…
exploding_data
  • 317
  • 1
  • 14
0
votes
0 answers

Apache Zeppelin Can't Write Deltatable to Spark

I'm attempting to run the following commands using the "%spark" interpreter in Apache Zeppelin: val data = spark.range(0, 5) data.write.format("delta").save("/tmp/delta-table") Which yields this output (truncated to omit repeat…
0
votes
2 answers

Create 1GB partitions Spark SQL

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3: # Vaccum to leave 1 week of…
0
votes
1 answer

Delta table update ONLY whenever any record gets updated

I am having a databricks delta table created on data lake storage which holds data as shown below. db_name table_name location table_format table_type …
Antony
  • 970
  • 3
  • 20
  • 46