Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Change databricks delta table typr to external

I have a MANAGED table in delta format in databrciks and I wanted to change it to EXTERNAL to make sure dropping the table would not affect the data. However the following code did not change the table TYPE and just added a new table property. How…
finman
  • 73
  • 1
  • 1
  • 4
0
votes
1 answer

How to handle memory issue while writing data in which a particular column contains very large data in each record in databricks in pyspark

I have a set of records with 10 columns. There is a column 'x' which contains an array of float values and the length of array can be very large(for eg, the length of array can be 25000000,50000000,80000000 etc) I am trying to read the data and…
Divzz
  • 63
  • 6
0
votes
1 answer

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

Today at work i saw a delta lake tracker based on file name. By delta tracker, i mean a function that defines whether a parquet file has already been ingested or not. The code would check what file (from the delta table) has not already been…
OrganicMustard
  • 1,158
  • 1
  • 15
  • 36
0
votes
1 answer

Databricks Delta table write performance slow

I am running everything in databricks. (everything is under the assumption that the data is pyspark dataframe) The scenario is: I have 40 files read as delta files in ADLS n then apply a series of transformation function(thru loop FIFO flow). At…
0
votes
2 answers

How to write to a folder in a Azure Data Lake container using Delta?

How do I write to a folder in an Azure Data Lake container using Delta? When I run: write_mode = 'overwrite' write_format = 'delta' save_path = '/mnt/container-name/folder-name' df.write \ .mode(write_mode) \ …
0
votes
2 answers

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case. The table schema is as shown…
Antony
  • 970
  • 3
  • 20
  • 46
0
votes
1 answer

Azure Syanpse Analytics

I have a need to connect to Synapse Analytics Serverless SQL Pool database using SQL Authentication. I created a serverless SQL Pool database and created a SQL User and provided db_owner access. Then created an external table below IF NOT EXISTS…
user13442358
  • 44
  • 1
  • 7
0
votes
1 answer

How to provide condition for UPSERT in databricks - PySpark

I have a table demo_table_one in which I want to upsert the following values data = [ (11111 , 'CA', '2020-01-26'), (11111 , 'CA', '2020-02-26'), (88888 , 'CA', '2020-06-10'), (88888 , 'CA', '2020-05-10'), (88888 , 'WA',…
John Constantine
  • 1,038
  • 4
  • 15
  • 43
0
votes
0 answers

Databricks Delta tables from json files: Ignore initial load when running COPY INTO

I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows: %python df =…
dwolfeu
  • 1,103
  • 2
  • 14
  • 21
0
votes
1 answer

Append the "_commit_timestamp" Column to the Latest Data Version When Reading from a DeltaTable

I have data in an delta lake WITHOUT a timestamp on each row to determine when that row was added/modified, but I only need rows that were created/modified after a specified date/time. I want the latest version of the data from the delta lake but…
SCP
  • 23
  • 1
  • 5
0
votes
1 answer

copy data from on premise sql server to delta format in Azure Data Lake Storage Gen2

I have a copy activity that copy on premise sql data to parquet format in data lake gen2. But I need to copy sql data to delta format in the same data lake. I tried using the data flow to copy from parquet to delta but we have performance issues in…
Ramkumar
  • 57
  • 1
  • 8
0
votes
1 answer

How to Create External Tables (similar to Hive) on Azure Delta Lake

How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a…
Sidd
  • 261
  • 1
  • 6
  • 24
0
votes
0 answers

java.lang.NoClassDefFoundError: org/apache/spark/sql/connector/catalog/TableProvider

i am trying to alter the delta table using spark sql query in java code. but when i am excuting my jar at cluster i am getting error like : org.spark_project.guava.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError:…
Yogesh
  • 89
  • 2
  • 6
0
votes
1 answer

Databricks - Find the delta tables which were recently updated

I am working on a use case in Databricks - GCP, where I am trying to find out the delta tables in a schema/database in Databricks which were updated in last 1 day. I used DESCRIBE DETAIL and ran this command in loop for all the table…
0
votes
2 answers

Pyspark : how to get specific file based on date to load into dataframe from list of file

I'm trying to load a specific file from group of file. example : I have files in hdfs in this format app_name_date.csv, i have 100's of files like this in a directory. i want to load a csv file into dataframe based on date. dataframe1 =…
v33ran00l
  • 31
  • 3