Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

vote

1 answer

Spark micro-batches from delta lake are very small

I'm reading appends to a delta table in Azure storage, and something strange is happening. The cluster is not under any real load, but the offset checkpoint advances very slowly. Looking at the individual offsets being written, the offset progress…

asked Jun 03 '21 at 00:00

Peter Dowdy

vote

1 answer

delete from databricks delta table based on partition column

I have a delta table with 5 partitions, one of the partition being a runid column. When I try to delete using the runid underlying parquet files gets deleted, after using vacuum command. But this does not remove the runid partition. If I run the…

databricks delta-lake

asked May 25 '21 at 07:31

justanothertekguy

vote

1 answer

How to access DeltaLake Tables without Databrick Cluster running

I have created DeltaLake Tables on DataBricks Cluster. And I am able to access these tables from external system/application. Though I need to keep the cluster up and running all the time to be able to access the table data. Question: Is it…

databricks azure-databricks delta-lake databricks-connect

asked May 17 '21 at 18:47

AmitG

vote

0 answers

Table or view not found when reading existing delta table

I am new to Delalake. I was trying a simple example. Create dataframe from a csv Save it is as delta table Read it again. It works fine. I can see the files are created in the default spark-warehouse folder. But Next time I just want to read the…

scala apache-spark delta-lake

asked May 10 '21 at 07:56

Rajan

vote

1 answer

Databricks: running spark-submit job with external jar file, 'Failed to load class' error

I am trying to test the following library: https://tech.scribd.com/blog/2021/introducing-sql-delta-import.html I want to copy data from my SQL database to a data lake, in the delta format. I have created a mount point, databases, and an empty delta…

sql-server scala databricks delta-lake

asked May 07 '21 at 11:04

Grevioos

vote

1 answer

Delta Table to Spark Streaming to Synapse Table in azure databricks

I need to write and synchronize our merged DELTA Tables to Azure Data warehouse. We are trying to read the Delta Table and but spark streaming doesn't allow Write Streaming to Synapse Tables. Then I tried reading the DELTA tables in parquet file in…

apache-spark spark-streaming azure-databricks delta-lake

asked May 03 '21 at 17:17

ROHIT BANSAL

vote

1 answer

Lineage not created when scanning Delta table in Azure Purview

A delta table is created from data bricks under the Azure blob storage container by providing its mount path. It is scanned in Azure purview using the Azure blob storage asset, the Lineage is not generated. It would be helpful if any suggestion to…

databricks azure-databricks delta-lake azure-data-lake-gen2 azure-purview

asked Apr 28 '21 at 02:42

Vignesh G

vote

2 answers

spark "delta" source not found

While using kafka and delta_core dependencies in a spark project I'm receiving the next warning: [WARNING] delta-core_2.12-0.7.0.jar, spark-sql-kafka-0-10_2.12-3.1.1.jar define 1 overlapping resources: [WARNING] -…

maven apache-spark maven-shade-plugin delta-lake

asked Apr 26 '21 at 14:55

B. Bal

vote

1 answer

How to call vacuum with a dry run in Python for a Delta Lake

I can see an example on how to call the vacuum function for a Delta lake in python here. But how do I call it for only a dry run? In other words, what is the equivalent Python code for the following? %sql VACUUM delta.`dbfs:/mnt/` DRY RUN

python apache-spark databricks delta-lake

asked Apr 26 '21 at 11:59

MetallicPriest

29,191
52
200
356

vote

1 answer

How to use vacuum to delete old files created by compaction without losing ability to time travel

I am running OPTIMIZE command for compaction. Now I want to delete the old files left out after compaction. So if I use vacuum command, then I am not able to do time travel. So, what is the better way to delete old files left out due to compaction…

databricks azure-databricks delta-lake

asked Apr 21 '21 at 04:04

Priyanshu

vote

1 answer

How to fix corrupted delta lake table on AWS S3

I ended up manually deleting some delta lake entries(hosted on S3) . Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system. I came across this…

amazon-s3 delta-lake fsck

asked Apr 20 '21 at 21:27

kk1957

8,246
10
41
63

vote

2 answers

delta (OSS) + MERGE recreating the underlying parquet files though there is no change in the incoming data

I am using delta (OSS - version 0.7.0 with pyspark 3.0.1) and the table is getting modified (merge) every 5 mins - microbatch pyspark script. When I run for the first time it created 18 small files (numTargetRowsInserted -> 32560) and I used the…

delta-lake

asked Apr 20 '21 at 20:44

Rak

vote

1 answer

Set NOT NULL columns in koalas to_table

when I create a Delta table I can set some columns to be NOT NULL CREATE TABLE [db_name.]table_name [(col_name1 col_type1 [NOT NULL], ...)] USING DELTA Is there any way to set non null columns with koalas.to_table?

pyspark databricks delta-lake spark-koalas

asked Apr 19 '21 at 15:48

kismsu

1,049
7
22

vote

0 answers

How to take a subset of parquet files to create a deltatable using deltalake_rs python library

I am using deltalake 0.4.5 Python library to read .parquet files into a deltatable and then convert into a pandas dataframe, following the instructions here: https://pypi.org/project/deltalake/. Here's the Python code to do this: from deltalake…

python-3.x parquet pyarrow delta-lake

asked Apr 13 '21 at 23:24

Rafiq

1,380
4
16
31

vote

0 answers

Deltalake Python library can't be installed on Amazon Linux 2 EC2 instance for Python 3.7 or 3.8

I am trying to install deltalake Python library (https://github.com/delta-io/delta-rs/tree/main/python) on an Amazon Linux 2 EC2 instance which was launched from amzn2-ami-hvm-2.0.20210318.0-x86_64-gp2 (us-east-1 region). It fails as showing…

python-3.x amazon-web-services amazon-ec2 delta-lake amazon-linux-2

asked Apr 10 '21 at 19:50

Rafiq

1,380
4
16
31

Prev 1 2 3

…

81 82 Next