Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

Change databricks delta table typr to external

I have a MANAGED table in delta format in databrciks and I wanted to change it to EXTERNAL to make sure dropping the table would not affect the data. However the following code did not change the table TYPE and just added a new table property. How…

pyspark databricks delta-lake

asked May 16 '22 at 14:52

finman

votes

1 answer

How to handle memory issue while writing data in which a particular column contains very large data in each record in databricks in pyspark

I have a set of records with 10 columns. There is a column 'x' which contains an array of float values and the length of array can be very large(for eg, the length of array can be 25000000,50000000,80000000 etc) I am trying to read the data and…

pyspark azure-databricks delta-lake

asked May 05 '22 at 10:44

Divzz

votes

1 answer

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

Today at work i saw a delta lake tracker based on file name. By delta tracker, i mean a function that defines whether a parquet file has already been ingested or not. The code would check what file (from the delta table) has not already been…

azure apache-spark pyspark azure-databricks delta-lake

asked May 02 '22 at 15:04

OrganicMustard

1,158
1
15
36

votes

1 answer

Databricks Delta table write performance slow

I am running everything in databricks. (everything is under the assumption that the data is pyspark dataframe) The scenario is: I have 40 files read as delta files in ADLS n then apply a series of transformation function(thru loop FIFO flow). At…

etl databricks delta-lake

asked Apr 26 '22 at 19:16

Uneducated Guess

votes

2 answers

How to write to a folder in a Azure Data Lake container using Delta?

How do I write to a folder in an Azure Data Lake container using Delta? When I run: write_mode = 'overwrite' write_format = 'delta' save_path = '/mnt/container-name/folder-name' df.write \ .mode(write_mode) \ …

databricks azure-databricks delta-lake azure-data-lake-gen2

asked Apr 26 '22 at 09:36

KamKam

votes

2 answers

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case. The table schema is as shown…

scala apache-spark apache-spark-sql azure-databricks delta-lake

asked Apr 19 '22 at 07:43

Antony

votes

1 answer

Azure Syanpse Analytics

I have a need to connect to Synapse Analytics Serverless SQL Pool database using SQL Authentication. I created a serverless SQL Pool database and created a SQL User and provided db_owner access. Then created an external table below IF NOT EXISTS…

azure-synapse delta-lake

asked Apr 18 '22 at 14:13

user13442358

votes

1 answer

How to provide condition for UPSERT in databricks - PySpark

I have a table demo_table_one in which I want to upsert the following values data = [ (11111 , 'CA', '2020-01-26'), (11111 , 'CA', '2020-02-26'), (88888 , 'CA', '2020-06-10'), (88888 , 'CA', '2020-05-10'), (88888 , 'WA',…

apache-spark pyspark databricks upsert delta-lake

asked Apr 13 '22 at 18:02

John Constantine

1,038
4
15
43

votes

0 answers

Databricks Delta tables from json files: Ignore initial load when running COPY INTO

I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows: %python df =…

json apache-spark databricks delta-lake

asked Apr 06 '22 at 09:05

dwolfeu

1,103
2
14
21

votes

1 answer

Append the "_commit_timestamp" Column to the Latest Data Version When Reading from a DeltaTable

I have data in an delta lake WITHOUT a timestamp on each row to determine when that row was added/modified, but I only need rows that were created/modified after a specified date/time. I want the latest version of the data from the delta lake but…

scala azure-databricks delta-lake

asked Mar 31 '22 at 09:31

SCP

votes

1 answer

copy data from on premise sql server to delta format in Azure Data Lake Storage Gen2

I have a copy activity that copy on premise sql data to parquet format in data lake gen2. But I need to copy sql data to delta format in the same data lake. I tried using the data flow to copy from parquet to delta but we have performance issues in…

azure-databricks azure-data-factory delta-lake

asked Mar 31 '22 at 03:31

Ramkumar

votes

1 answer

How to Create External Tables (similar to Hive) on Azure Delta Lake

How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a…

azure-databricks azure-data-lake delta-lake

asked Mar 30 '22 at 21:14

Sidd

votes

0 answers

java.lang.NoClassDefFoundError: org/apache/spark/sql/connector/catalog/TableProvider

i am trying to alter the delta table using spark sql query in java code. but when i am excuting my jar at cluster i am getting error like : org.spark_project.guava.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError:…

java scala apache-spark-sql delta-lake

asked Mar 11 '22 at 09:00

Yogesh

votes

1 answer

Databricks - Find the delta tables which were recently updated

I am working on a use case in Databricks - GCP, where I am trying to find out the delta tables in a schema/database in Databricks which were updated in last 1 day. I used DESCRIBE DETAIL and ran this command in loop for all the table…

google-cloud-platform databricks delta-lake database-metadata

asked Mar 02 '22 at 21:28

Alifiya Ali

votes

2 answers

Pyspark : how to get specific file based on date to load into dataframe from list of file

I'm trying to load a specific file from group of file. example : I have files in hdfs in this format app_name_date.csv, i have 100's of files like this in a directory. i want to load a csv file into dataframe based on date. dataframe1 =…

dataframe apache-spark pyspark delta-lake

asked Mar 02 '22 at 10:17

v33ran00l

Prev 1 2 3

…

81 82 Next