Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

delta-io/delta project compile failed

The project is delta. I've already installed sbt-1.2.8 on my Mac. However, I can't compile this project by using "build/sbt compile". After I type the command and the error is below: java.lang.NoClassDefFoundError: scala/reflect/internal/Trees …

asked Aug 05 '19 at 09:31

Jiayi Liao

votes

1 answer

Databricks Checksum error while writing to a file

I am running a job in 9 nodes. All of them are going to write some information to files doing simple writes like below: dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation) However I am receiving this…

apache-spark error-handling pyspark azure-databricks delta-lake

asked Jul 12 '19 at 19:30

Flavio Pegas

votes

1 answer

loading data into delta lake from azure blob storage

I am trying to load data into delta lake from azure blob storage. I am using below code snippet storage_account_name = "xxxxxxxxdev" storage_account_access_key = "xxxxxxxxxxxxxxxxxxxxx" file_location =…

python-3.x pyspark azure-blob-storage delta-lake

asked Jun 29 '19 at 11:49

Abhirup Bose

votes

1 answer

How to specify the location of a deltalake table in spark structured streaming?

I have a streaming data incoming which I am saving as a deltalake table using the below code: cast_dataframe.writeStream.format("delta").outputMode("append") .option("checkpointLocation",checkpointLocation) .table(databasename+"."+tablename) Here…

apache-spark spark-structured-streaming delta-lake

asked Jun 27 '19 at 14:51

Pankaj Mishra

votes

0 answers

How to stop concurrent writing in Delta Lake External Table?

General EXTERNAL table like Oracle doesn't allow Insert/Update operation. But Databricks EXTERNAL Delta Table enables Update/Insert operation. This way I can see a flaw, or is there anyway to stop that? Example - CREATE TABLE employee USING…

azure-databricks delta-lake

asked Jun 16 '19 at 14:49

Anirban Nag 'tintinmj'

5,572
6
39
59

votes

1 answer

Can we use Scala to Perform UPDATE and DELETE operations on Databricks Delta tables?

I am able to create databricks delta tables using scala and able to perform append and overwrite operations over it. Is there any way I can perform DELETE and UPDATE operations using scala and not through Databricks runtime. val target = Seq( …

scala apache-spark delta-lake

asked May 29 '19 at 13:39

Arun

-1

votes

1 answer

Change Column name in table and delta files?

I have a folder delta_table with delta files and I have created a Table called test_delta_table based on the delta files. How can I change the name of a column in the underlying delta files and in the table itself? I'm getting in a spark error if I…

apache-spark pyspark azure-synapse delta-lake

asked Aug 23 '23 at 08:59

AzUser1

-1

votes

1 answer

Bad Performance using Open Table Format

I have an existing case: where Entire/Full data is read daily from multiple hive tables, Which is processed/transformed (join, aggregation, filter, etc) as mentioned in SQL query. These SQL query are mentioned in series of YAML files , let's say…

sql apache-spark spark-structured-streaming delta-lake apache-hudi

asked Aug 13 '23 at 21:49

Rituparno Behera

-1

votes

0 answers

Failed to merge schema when using CONVERT TO DELTA on a folder with parquet files

I have parquet files stored in ADLS gen2 with such structure: year/month/day part_*.snappy.parquet The files in the folders represent the same dataset, but which schema is changing ("evolving") over time. So, for 2023-01-15 schema of the parquet…

database-design databricks parquet delta-lake azure-data-lake-gen2

asked Aug 07 '23 at 23:01

archjkeee

-1

votes

1 answer

Databricks Statistics on delta table

Do You think statistics on column generation have sense in delta lake ? Does it optimize joins & aggregations or maybe there counts only statistics inside _delta_log ? enter image description here tryied to see if statistics have influence on…

pyspark databricks delta-lake

asked Jul 19 '23 at 11:33

Edzio Edziowski

-1

votes

2 answers

what happens when we delete spark managed tables?

I recently started learning about spark. I was studying about spark managed tables. so as per docs " spark manages the both the data and metadata". Assume that i have a csv file in s3 and I read it into data frame like below. df =…

apache-spark pyspark bigdata databricks delta-lake

asked Jul 10 '23 at 03:00

Ravi

2,778
2
20
32

-1

votes

1 answer

How delta table support SQL queries?

If we create a delta table(let's say in Azure ADLS), then the data is stored into ADLS in parquet file format. How are we able to use SQL queries on the stored data? Does Spark convert SQL queries into Java code internally?

apache-spark apache-spark-sql delta-lake

asked Jun 18 '23 at 07:51

Rushikesh

-1

votes

1 answer

What is a Delta table in a directory as it pertains to Databricks

If I were to ask the question What is a Delta table in a directory? Which of the following would be the correct answer a) It is a directory containing data files b) It is a single file containing the data c) It is a subdirectory that is named based…

databricks azure-databricks delta-lake

asked Apr 06 '23 at 10:19

Patterson

1,927
1
19
56

-1

votes

1 answer

Performance Hit when writing into the partitioned Tables

Can someone please help why the table is taking too much time to write when table is very small

performance pyspark databricks delta-lake

asked Jan 20 '23 at 16:02

Umer

-1

votes

1 answer

Writing to delta table fails with "not enough data columns"?

I was trying to execute below spark-sql code in data bricks which is doing Insert Overwriting on other table. which are have same no.of columns with same names. INSERT OVERWRITE TABLE cs_br_prov SELECT…

databricks delta-lake

asked May 05 '22 at 15:23

venkat

Prev 1 2 3

…

82 Next