Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
10
votes
2 answers

A schema mismatch detected when writing to the Delta table - Azure Databricks

I try to load "small_radio_json.json" to Delta Lake table. After this code I would create table. I try create Delta table but getting error "A schema mismatch detected when writing to the Delta table." It may be related to partition of the …
Kenny_I
  • 2,001
  • 5
  • 40
  • 94
10
votes
1 answer

Apache Spark: impact of repartitioning, sorting and caching on a join

I am exploring Spark's behavior when joining a table to itself. I am using Databricks. My dummy scenario is: Read an external table as dataframe A (underlying files are in delta format) Define dataframe B as dataframe A with only certain columns…
Dawid
  • 652
  • 1
  • 11
  • 24
10
votes
6 answers

Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake…
gorros
  • 1,411
  • 1
  • 18
  • 29
10
votes
4 answers

Trouble when writing the data to Delta Lake in Azure databricks (Incompatible format detected)

I need to read dataset into a DataFrame, then write the data to Delta Lake. But I have the following exception : AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to…
Themis
  • 139
  • 1
  • 1
  • 8
9
votes
2 answers

How to configure Spark to adjust the number of output partitions after a join or groupby?

I know you can set spark.sql.shuffle.partitions and spark.sql.adaptive.advisoryPartitionSizeInBytes. The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses…
9
votes
2 answers

Delta table merge on multiple columns

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns DeltaTable.forPath(spark, "path") .as("data") .merge( finalDf1.as("updates"), "data.column1 = updates.column1 AND…
Tony
  • 301
  • 3
  • 10
9
votes
1 answer

Processing upserts on a large number of partitions is not fast enough

The Problem We have a Delta Lake setup on top of ADLS Gen2 with the following tables: bronze.DeviceData: partitioned by arrival date (Partition_Date) silver.DeviceData: partitioned by event date and hour (Partition_Date and Partition_Hour) We…
9
votes
1 answer

What is the pyspark equivalent of MERGE INTO for databricks delta lake?

The databricks documentation describes how to do a merge for delta-tables. In SQL the syntax MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [ WHEN…
Erik
  • 755
  • 1
  • 5
  • 17
8
votes
2 answers

Insert or Update a delta table from a dataframe in Pyspark

I have a pyspark dataframe currently from which I initially created a delta table using below code - df.write.format("delta").saveAsTable("events") Now, since the above dataframe populates the data on daily basis in my requirement, hence for…
Tushaar
  • 167
  • 1
  • 1
  • 9
8
votes
5 answers

How to drop duplicates in Delta Table?

there is a function to delete data from a Delta Table: deltaTable = DeltaTable.forPath(spark, "/data/events/") deltaTable.delete(col("date") < "2017-01-01") But is there also a way to drop duplicates somehow? Like deltaTable.dropDuplicates()... I…
Lossa
  • 341
  • 2
  • 3
  • 9
8
votes
2 answers

spark delta overwrite a specific partition

So I have a dataframe which has a column, file_date. For a given run, the dataframe has only data for one unique file_date. For instance, in a run, let us assume that there are say about 100 records with a file_date of 2020_01_21. I am writing this…
SWDeveloper
  • 319
  • 1
  • 4
  • 14
8
votes
1 answer

Create index for tables within Delta Lake

I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is by creating the Data Skipping then indexing the skipped portion: create DATASKIPPING…
user12264392
  • 81
  • 1
  • 1
  • 3
7
votes
2 answers

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear. I…
7
votes
5 answers

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck. %sql ALTER TABLE…
chaitra k
  • 371
  • 1
  • 4
  • 18
7
votes
2 answers

On-premise delta lake

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed? I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache…
Ajoy
  • 113
  • 1
  • 1
  • 10
1
2
3
81 82