Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

Reference 'unit' is ambiguous, could be: unit, unit

I'm trying to load all incoming parquet files from an S3 Bucket, and process them with delta-lake. I'm getting an exception. val df = spark.readStream().parquet("s3a://$bucketName/") df.select("unit") //filter data! .writeStream() …

asked Oct 20 '20 at 07:19

Tamás

votes

1 answer

How to add Delta Lake support to Zeppelin's spark interpreter?

I'm trying to add the Delta Lake support to Zeppelin. So far I've tried adding the io.delta:delta-core_2.12:0.7.0 dependency to the spark interpreter, as well as a couple other related actions within the interpreters view... but nothing has…

scala apache-spark apache-zeppelin delta-lake

asked Oct 18 '20 at 05:17

kylemart

1,156
1
13
25

votes

1 answer

Streaming data into delta lake, reading filtered results

My goal is to continuously put incoming parquet files into delta-lake, make queries, and get the results into a Rest API. All files are in s3 buckets. //listen for changes val df = spark.readStream().parquet("s3a://myBucket/folder") //write changes…

java apache-spark kotlin delta-lake

asked Oct 16 '20 at 10:47

Tamás

votes

1 answer

Error when trying to move data from on-prem SQL database to Azure Delta lake

I am trying to move large amounts of reference data from on-prem SQL server to Delta lake to be used in databricks processing. To move this data, I am trying use Azure Data Factory via simple Copy data activity. but as soon as I start the pipeline I…

azure azure-data-factory delta-lake

asked Oct 15 '20 at 12:25

rsapru

votes

0 answers

Add new column to the existing table in Delta lake(Gen2 blob storage)

Curious to know, can we add a new column to the existing Delta Lake table stored in the Gen2 blob storage, based on the business use case, i will need to add additional 3 more columns to one of the table in delta lake. ALTER TABLE doesn't worked for…

apache-spark-sql delta-lake azure-data-lake-gen2

asked Oct 09 '20 at 10:47

chaitra k

votes

1 answer

Why Databricks Delta is copying unmodified rows even when merge doesn't update anything?

When I run a following query: merge into test_records t using ( select id, "senior developer" title, country from test_records where country = 'Brazil' ) u on t.id = u.id when matched and (t.id <> u.id) then -- this is just to be sure that nothing…

apache-spark-sql databricks delta-lake

asked Oct 06 '20 at 09:37

Jakub Troszok

99,267
11
41
53

votes

1 answer

Migrate (on-prem) SQL data to Azure with Databricks (JDBC)

Is it possible to use the JDBC connector https://docs.databricks.com/data/data-sources/sql-databases.html in order to get data from local SQL server. (and export it to delta lake) Using: jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname,…

sql-server jdbc azure-sql-database azure-databricks delta-lake

asked Sep 18 '20 at 11:41

Jb11

votes

2 answers

ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake

Question main question How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system? rationale We're designing ephemeral data…

snowflake-cloud-data-platform data-warehouse external-tables delta-lake dbt

asked Sep 17 '20 at 00:38

Anders Swanson

3,637
1
18
43

votes

1 answer

Deltalake error- MERGE destination only supports Delta sources

I am trying to implement scd-type-2 in delta lake and i am getting the following error- "MERGE destination only supports Delta sources". Below is the snippet code i am executing. MERGE INTO stageviews.employeetarget t USING ( …

apache-spark pyspark azure-databricks delta-lake

asked Sep 10 '20 at 15:46

Kiran A

votes

2 answers

why pyspark find local hive metastore only from root directory?

I've a question on hive metastore support for delta lake, I've defined a metastore on a standalone spark session with the following configurations pyspark --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"…

pyspark delta-lake hive-metastore

asked Sep 06 '20 at 06:10

Barak Avrahami

votes

1 answer

Merge into delta table not working with java foreachbatch

I have created a delta table and now I'm trying to insert data to that table using foreachBatch(). I've followed this example. The only difference is that I'm using Java and not in a notebook, but I suppose that should not make any difference? My…

java apache-spark spark-structured-streaming delta-lake

asked Sep 04 '20 at 09:56

RudyVerboven

1,204
1
14
31

votes

2 answers

pyspark delta lake optimize - fails to parse SQL

I have a delta table created as using spark 3.x and delta 0.7.x: data = spark.range(0, 5) data.write.format("delta").mode("overwrite").save("tmp/delta-table") # add some more files data = spark.range(20,…

apache-spark pyspark delta-lake

asked Aug 28 '20 at 14:44

Georg Heiler

16,916
36
162
292

votes

0 answers

Databricks Delta Table Merge Example with Comparison with Update

I have found a ton of examples showing how to Merge data using Databricks Delta Table Merge to load data to SQL DB. However, I'm trying to find examples whereby trying to load data to SQL DB without Databricks Delta Merge fails. This is because I'm…

apache-spark databricks azure-databricks delta-lake

asked Aug 22 '20 at 16:19

Carltonp

1,166
5
19
39

votes

1 answer

List views in delta-lake

I'm trying to figure out a way to list the views in a given delta-Lake database? Is there anything available that's equivalent of sqlServer's INFORMATION_SCHEMA or something obvious that I'm missing? I have tried the following without any…

apache-spark-sql databricks delta-lake

asked Aug 03 '20 at 20:54

Manish Karki

votes

4 answers

Confusion About Delta Lake

I have tried to read a lot about databricks delta lake. From what I understand it adds ACID transactions to your data storage and accelerated query performance with a delta engine. If so, why do we need other data lakes which do not support ACID…

delta-lake

asked Jul 30 '20 at 09:32

user13128577

Prev 1 2 3

…

81 82 Next